[DISCUSS] Star Schema Benchmark on Druid

Dear Druid team:
I am a green hand in Druid and want to know the performance of Druid, so I use SSB (Star Schema Benchmark) to want to get avarage response time of SSB. So I can compare Druid with other OLAP engine/adhoc SQL engine (like Clickhouse/Doris/Kylin/SparkSQL). Here I am using a new version of Druid(0.15) in CDH5.7 cluster. My cluster has two worker node, each with 10 cores and 60GB RAM, disk space is enough. I deploy my Druid’s MiddleManager and Historical on two worker nodes.

As far as I know, Druid SQL didn’t support Join so we need to use Lookup function. Following is what I do. Could you please help me to review or check if it is reasonable and all Druid process is configured properly?

I am glad to here your suggestion. Thank you.

Definition of index task and Lookup.

{

“type”: “index_hadoop”,

“spec”: {

“dataSchema”: {

“dataSource”: “SSB_12”,

“parser”: {

“type”: “hadoopyString”,

“parseSpec”: {

“format”: “tsv”,

“delimiter”: “|”,

“columns”: [

“lo_orderkey”,

“lo_linenumber”,

“lo_custkey”,

“lo_partkey”,

“lo_suppkey”,

“lo_orderdate”,

“lo_orderpriotity”,

“lo_shippriotity”,

“lo_quantity”,

“lo_extendedprice”,

“lo_ordtotalprice”,

“lo_discount”,

“lo_revenue”,

“lo_supplycost”,

“lo_tax”,

“lo_commitdate”,

“lo_shipmode”

],

“timestampSpec”: {

“column”: “lo_orderdate”,

“format”: “yyyymmdd”

},

“dimensionsSpec”: {

“dimensions”: [

“lo_custkey”,

“lo_orderdate”,

“lo_partkey”,

“lo_suppkey”,

“lo_discount”,

“lo_quantity”

],

“dimensionExclusions”: ,

“spatialDimensions”:

}

}

},

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

},

{

“type”: “longSum”,

“name”: “lo_revenue”,

“fieldName”: “lo_revenue_sum”

},

{

“type”: “longSum”,

“name”: “lo_supplycost”,

“fieldName”: “lo_supplycost_sum”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “YEAR”,

“queryGranularity”: “DAY”,

“rollup”: true,

“intervals”: [

“1992-01-01/1998-08-02”

]

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“paths”: “hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.1,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.2,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.3,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.4,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.5,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.6,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.7,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.8,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.9,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.10,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.11,hdfs://cdh-master:8020/user/root/ssb_12/data/lineorder/lineorder.tbl.12”

}

},

“tuningConfig”: {

“type”: “hadoop”,

“jobProperties”: {

“mapreduce.job.classloader”: “true”,

“mapreduce.job.user.classpath.first”: “true”,

“mapreduce.reduce.java.opts”: “-Xmx7g”,

“mapreduce.reduce.memory.mb”: “8192”,

“mapreduce.map.memory.mb”: “8192”,

“mapreduce.map.java.opts”: “-Xmx7g”

}

}

}

}

{

“type”: “cachedNamespace”,

“extractionNamespace”: {

“type”: “uri”,

“uri”: “hdfs://cdh-master:8020/user/root/ssb_5/data/part/part.tbl”,

“namespaceParseSpec”: {

“format”: “tsv”,

“columns”: [

“p_partkey”,

“p_name”,

“p_mfgr”,

“p_category”,

“p_brand”,

“p_type”,

“p_size”,

“p_container”

],

“keyColumn”: “p_partkey”,

“valueColumn”: “p_mfgr”,

“delimiter”: “|”

},

“pollPeriod”: “PT5M”

},

“firstCacheTimeout”: 120000,

“injective”: true

}

Rewrited SSB SQL(Druid version)

  • query 1.1

select sum(lo_revenue) as revenue

from SSB_12

where LOOKUP(lo_orderdate, 'd_year') = 1993

and lo_discount between 1 and 3

and lo_quantity < 25;

  • query 1.2

select sum(lo_revenue) as revenue

from SSB_12

where LOOKUP(lo_orderdate, 'd_yearmonthnum') = 199401

and lo_discount between 4 and 6

and lo_quantity between 26 and 35

  • query 1.3

select sum(lo_revenue) as revenue

from SSB_12

where  LOOKUP(lo_orderdate, 'd_weeknuminyear') = 6 and  LOOKUP(lo_orderdate, 'd_year') = 1994

and lo_discount between 5 and 7

and lo_quantity between 26 and 35;

  • query 2.1

select sum(lo_revenue) as lo_revenue, LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_partkey, 'p_brand')

from SSB_12

where LOOKUP(lo_partkey,'p_category') = 'MFGR#12' and  LOOKUP(lo_suppkey,'s_region') = 'AMERICA'

group by LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_partkey, 'p_brand')

order by LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_partkey, 'p_brand')

  • query 2.2

select sum(lo_revenue) as lo_revenue,  LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_partkey, 'p_brand')

from SSB_12

where LOOKUP(lo_partkey, 'p_brand') between 'MFGR#2221' and 'MFGR#2228' and  LOOKUP(lo_suppkey,'s_region')  = 'ASIA'

group by LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_partkey, 'p_brand')

order by LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_partkey, 'p_brand')

  • query 2.3

select sum(lo_revenue) as lo_revenue, LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_partkey, 'p_brand')

from SSB_12

where LOOKUP(lo_partkey, 'p_brand')  = 'MFGR#2239' and LOOKUP(lo_suppkey,'s_region') = 'EUROPE'

group by LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_partkey, 'p_brand')

order by LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_partkey, 'p_brand')

  • query 3.1

select LOOKUP(lo_custkey,'c_nation'), LOOKUP(lo_suppkey,'s_nation'), LOOKUP(lo_orderdate, 'd_year'), sum(lo_revenue) as lo_revenue

from SSB_12

where LOOKUP(lo_custkey,'c_region') = 'ASIA' and LOOKUP(lo_suppkey,'s_region') = 'ASIA'and LOOKUP(lo_orderdate, 'd_year') >= 1992 and LOOKUP(lo_orderdate, 'd_year') <= 1997

group by LOOKUP(lo_custkey,'c_nation'), LOOKUP(lo_suppkey,'s_nation'), LOOKUP(lo_orderdate, 'd_year')

order by LOOKUP(lo_orderdate, 'd_year') asc, lo_revenue desc

  • query 3.2

select LOOKUP(lo_custkey,'c_city'), LOOKUP(lo_suppkey,'s_city'), LOOKUP(lo_orderdate, 'd_year'), sum(lo_revenue) as lo_revenue

from SSB_12

where LOOKUP(lo_custkey,'c_nation') = 'UNITED STATES' and LOOKUP(lo_suppkey,'s_nation') = 'UNITED STATES'

and LOOKUP(lo_orderdate, 'd_year') >= 1992 and LOOKUP(lo_orderdate, 'd_year') <= 1997

group by LOOKUP(lo_custkey,'c_city'), LOOKUP(lo_suppkey,'s_city'), LOOKUP(lo_orderdate, 'd_year')

order by LOOKUP(lo_orderdate, 'd_year') asc, lo_revenue desc

  • query 3.3

select LOOKUP(lo_custkey,'c_city'), LOOKUP(lo_suppkey,'s_city'), LOOKUP(lo_orderdate, 'd_year'), sum(lo_revenue) as lo_revenue

from SSB_12

where (LOOKUP(lo_custkey,'c_city')='UNITED KI1' or LOOKUP(lo_custkey,'c_city')='UNITED KI5')

and (LOOKUP(lo_suppkey,'s_city')='UNITED KI1' or LOOKUP(lo_suppkey,'s_city')='UNITED KI5')

and LOOKUP(lo_orderdate, 'd_year') >= 1992 and LOOKUP(lo_orderdate, 'd_year') <= 1997

group by LOOKUP(lo_custkey,'c_city'), LOOKUP(lo_suppkey,'s_city'), LOOKUP(lo_orderdate, 'd_year')

order by LOOKUP(lo_orderdate, 'd_year') asc, lo_revenue desc

  • query 3.4

select LOOKUP(lo_custkey,'c_city'), LOOKUP(lo_suppkey,'s_city'), LOOKUP(lo_orderdate, 'd_year'), sum(lo_revenue) as lo_revenue

from SSB_12

where (LOOKUP(lo_custkey,'c_city')='UNITED KI1' or LOOKUP(lo_custkey,'c_city')='UNITED KI5') and (LOOKUP(lo_suppkey,'s_city')='UNITED KI1' or LOOKUP(lo_suppkey,'s_city')='UNITED KI5') and LOOKUP(lo_orderdate, 'd_yearmonth') = 'Dec1997'

group by LOOKUP(lo_custkey,'c_city'), LOOKUP(lo_suppkey,'s_city'), LOOKUP(lo_orderdate, 'd_year')

order by LOOKUP(lo_orderdate, 'd_year') asc, lo_revenue desc

  • query4.1

select LOOKUP(lo_orderdate, 'd_year') as d_year, LOOKUP(lo_custkey, 'c_nation') as c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit

from SSB_12

where LOOKUP(lo_custkey, 'c_region') = 'AMERICA'

and LOOKUP(lo_suppkey, 's_region') = 'AMERICA'

and (LOOKUP(lo_partkey, 'p_mfgr') = 'MFGR#1' or LOOKUP(lo_partkey, 'p_mfgr') = 'MFGR#2')

group by LOOKUP(lo_orderdate, 'd_year'),  LOOKUP(lo_custkey, 'c_nation')

order by LOOKUP(lo_orderdate, 'd_year'),  LOOKUP(lo_custkey, 'c_nation')

  • query 4.2

select LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_suppkey,'s_nation'), LOOKUP(lo_partkey, 'p_category'), sum(lo_revenue) - sum(lo_supplycost) as profit

from SSB_12

where LOOKUP(lo_custkey,'c_region') = 'AMERICA' and LOOKUP(lo_suppkey,'s_region') = 'AMERICA'

and (LOOKUP(lo_orderdate, 'd_year') = 1997 or LOOKUP(lo_orderdate, 'd_year') = 1998)

and (LOOKUP(lo_partkey, 'p_mfgr') = 'MFGR#1' or LOOKUP(lo_partkey, 'p_mfgr') = 'MFGR#2')

group by LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_suppkey,'s_nation'), LOOKUP(lo_partkey, 'p_category')

order by LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_suppkey,'s_nation'), LOOKUP(lo_partkey, 'p_category')

  • query 4.3

select LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_suppkey,'s_city'), LOOKUP(lo_partkey, 'p_brand'), sum(lo_revenue) - sum(lo_supplycost) as profit

from SSB_12

where LOOKUP(lo_custkey,'c_region') = 'AMERICA' and LOOKUP(lo_custkey,'c_nation') = 'UNITED STATES'

and (LOOKUP(lo_orderdate, 'd_year') = 1997 or LOOKUP(lo_orderdate, 'd_year') = 1998)

and LOOKUP(lo_partkey, 'p_category') = 'MFGR#14'

group by LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_suppkey,'s_city'), LOOKUP(lo_partkey, 'p_brand')

order by LOOKUP(lo_orderdate, 'd_year'), LOOKUP(lo_suppkey,'s_city'), LOOKUP(lo_partkey, 'p_brand')

Config of Historical

druid.service=druid/historical

druid.plaintextPort=8083

HTTP server threads

druid.server.http.numThreads=60

Processing threads and buffers

druid.processing.buffer.sizeBytes=500000000

druid.processing.numMergeBuffers=2

druid.processing.numThreads=8

druid.processing.tmpDir=var/druid/processing

Segment storage

druid.segmentCache.locations=[{“path”:“var/druid/segment-cache”,“maxSize”:600000000000}]

druid.server.maxSize=600000000000

Query cache

druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

druid.cache.type=caffeine

druid.cache.sizeInBytes=750000000

-server

-Xms10g

-Xmx20g

-XX:+UseG1GC

-XX:MaxDirectMemorySize=40g

-XX:+ExitOnOutOfMemoryError

-XX:+PrintGCDetails

-XX:+PrintGCDateStamps

-XX:+PrintGCTimeStamps

-XX:+PrintGCApplicationStoppedTime

-XX:+PrintGCApplicationConcurrentTime

-Xloggc:/var/logs/druid/historical.gc.log

-XX:+UseGCLogFileRotation

-XX:NumberOfGCLogFiles=50

-XX:GCLogFileSize=10m

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=var/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

Result (Response Time)

Query ID
Return time (1)
Return time (2)
Return time (3)
Kylin(Exact Match)
q1.1
2.40
1.79
2.00
0.31
q1.2
1.19
0.74
0.89
0.34
q1.3
1.40
1.02
1.33
0.32
q2.1
4.12
3.19
3.40
0.48
q2.2
3.23
3.02
2.97
0.40
q2.3
2.50
2.37
2.30
0.29
q3.1
5.93
4.85
5.20
0.36
q3.2
2.09
1.68
1.76
0.36
q3.3
1.27
1.06
1.41
0.20
q3.4
0.48
0.30
0.34

Hi,

We have done similar benchmarking with SSB, instead of LOOKUP we choose to denormalize the data during ingestion and then run the queries over denormalized data at 1TB scale.

The results can be found here -

https://blog.cloudera.com/benchmark-update-apache-hive-druid-integration-hdp-3-0/

The code we used to run the benchmarks can be found here -

https://github.com/cartershanklin/hive-druid-ssb

Hi Nishant,

Thank you for your reply. I have read your article. Since our cluster didn’t meet your request( CDH 5.7 with Hive 1.x), we cannot using your code.

And I have a question, you said “denormalize the data during ingestion” , did you mean you join flat table with dimension table into a larger flat table? If it is, is it reasonable? Because I cannot update dimension data easily. In my mind, update dimension data by Lookup is more cheaper than reindex.

If I misunderstand you, please let me know, thank you.

在 2019年8月28日星期三 UTC+8上午11:12:01,Nishant Bangarwa写道:

Hi Xiaoxiang,

I have done functionality comparison between Druid and Kylin, but not PSR.

Few key functionalities are not supported in Kylin (last checked 2.6.3 version). Example

– query 2.1

select sum(lo_revenue) as lo_revenue, LOOKUP(lo_orderdate, ‘d_year’), LOOKUP(lo_partkey, ‘p_brand’)

from SSB_12

where LOOKUP(lo_partkey,‘p_category’) = ‘MFGR#12’ and LOOKUP(lo_suppkey,‘s_region’) = ‘AMERICA’ and lo_quantity < 25

group by LOOKUP(lo_orderdate, ‘d_year’), LOOKUP(lo_partkey, ‘p_brand’)

order by LOOKUP(lo_orderdate, ‘d_year’), LOOKUP(lo_partkey, ‘p_brand’)

As Kylin pre-aggregates the data, it is not possible to filter most granular data and then aggregate by higher dimension levels (ex:category -> brand -> sku).

Another limitation is that the cube definition is fixed. So beforehand one has to be aware of the query patterns. This beats the purpose of “self service BI”.

One more point, I think you can avoid one lookup by replacing LOOKUP(lo_orderdate, ‘d_year’) with time_extract(lo_orderdate,‘year’).

Regards, Chari.

As mentioned by Druid Docs Druid schema design

Thank you for your suggestion, I willl do another test and share my result as soon as possible.

在 2019年9月3日星期二 UTC+8下午3:33:09,Bin Li写道:

Thank you for your advice, I will do another test and update my test result asap.

在 2019年8月29日星期四 UTC+8下午7:30:01,Lakshminarayana Chari写道: