Druid configurations to optimize for TopN query

I have a 2 node druid setup where each node has 64GB ram and 6TB disk with 40 CPU cores.
I need to configure druid for best performance of the TopN query over different types of aggregations namely : longSum and hyperUnique

With the current setup a normal TopN query over a single valued dimension with longsum aggregation takes around 20+ seconds when fired for first time and approx 2 secs when queried repeatedly.

On the historical node ,the metric for “query/wait/time” has value of “17935” when queried for first time.

A topN query over single valued dimension but with hyperunique aggregation takes around 40 seconds when fired first time and around 10 sec on repeated querying.

The JVM options passed to historical node are : " -Xmx4g -Xms1g -XX:MaxNewSize=2g" and i also i verified that load average is not high on these nodes during query.

The other relevant configuration of historical node are

druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

druid.processing.buffer.sizeBytes=262144000

druid.processing.numThreads=39

druid.server.http.numThreads=50

druid.server.maxSize=300000000000

The broker configurations are :

-Xmx4g -Xms2g -XX:NewSize=1g -XX:MaxNewSize=2g -XX:MaxDirectMemorySize=64g

druid.broker.http.numConnections=20

druid.broker.http.readTimeout=PT5M

druid.processing.buffer.sizeBytes=2147483647

druid.processing.numThreads=31

druid.server.http.numThreads=50

I also observed that in broker metrics “query/node/ttfb” is almost same as query time hence i believe most of the query time is spent on the historical nodes

The datasource i am querying has 933 segments each of ~260 MB and since query is without any filters it applies to entire data.

I wanted to understand what could be the reason of such a high wait time?

And also which configurations will help me optimize the cluster for topN queries?

HI Rohit,

If your processing threads is equal to your number of cores, then your load average, even under disk io bound conditions, shouldn’t be silly high (as long as there are no other services running)

Have you checked What the CPU usage breakdown is during the long 40 second queries? I wouldn’t be surprised if your cpu is mostly io wait during that time, then mostly user during subsequent queries.

If that is the case it simply means that the page cache has not warmed up for the data files.

Also note that the JVM itself has to do optimizations when the server first starts. So your first few queries might be slow as all the JIT stuff gets optimized.

Part of what you are encountering is balancing the cpu to memory to disk ratio

Hello charles,
I did analyse the cpu usage of various topN queries running on different period of data with different kind of aggregations.

My observations are summarized in the table below:

Type of Query
Time Taken
Size of total segments for the queried interval
Cpu Idle
Cpu Wait
Cpu Usr
Page in
Page out
Remarks
Top 50 with only longsum aggregation
19 sec
245 GB
30
21
48
0
0
Top 50 with hyperunique aggregation
1 minute 46 sec
245 GB
20
0
80
0
0
Query was on same dimension as previous hence page cache must have been warmed
Previous query fired again
4.7 sec
245 GB
97
0
3
0
0
Top 50 with both hyperunique and longsum aggregation
50 sec
245 GB
21
0
78
0
0
1 historical node was heavly loaded on cpu another node was totally ideal
Top 50 with both hyperunique and longsum aggregation
No result even after 5 minutes
529 GB
70
20
6
Minimal(4096 KB)
0
1 historical node returned in 89 sec other never returned and i verified that full gc was not happening for non responding node
Top50 with only longsum
27 sec
529 GB
52
34
14
550 KB
6000 KB

These are few questions based on this experiment:

  • I had 2 node historical setup and 2 segments each of size approx 250 GB and having approx 1000 shards each,so whenever a query was ran on a single segment it used only one of the historical nodes,which makes me believe that 1 segment is always held completely by 1 historical node irrespective of number of shards it has and is not sharded across historical nodes.Is that true?
  • Hyperunique queries took a very long time although data was already in page cache and there was still some idle cpu at all point of time.What configuration should be changed to make these queries faster?
  • Queries on my druid setup are in general running slow even if there isn’t any paging in/out happening and not much cpu is waiting on i/o.What could be the reason for the same?
    Any suggestions on what could be the next steps to debug it further?

Thanks

Rohit

Cool, thanks for the data.

HyperUniques are indeed “slow” (they are still the fastest cardinality estimator we’ve found so far) and one of the areas where any improvements in speed make a significant impact on overall cluster performance.

If you look at the druid console on your coordinator, does it show an even distribution of segments across the two historical nodes?

If you have 80 virtual cores and 1000x 250MB shards, the cluster is going to start at about 13 segments deep per core (effectively). Even at a segment scan time of 1 second the query should ideally return on the order of 13 seconds. (The segment times should be closer to a few hundred ms, making the total time below 10 seconds) assuming the data is paged into memory.

If you’re wanting to debug speed issues, you’ll probably want to turn off caching while you’re doing tuning because caching makes stuff waaaaay faster but less predictable (Stuff can get evicted from cache and screw up your benchmarks)

I find it odd the hyperunique and longsum aggregation didn’t return.

If you turn off caching and try the poor performing queries again, and still get no io wait when you repeatedly query, then you can probably handle a larger heap setting. Did you collect any metrics on total GC time during the different queries?

Also, as a sanity check, you’ll want to warm up the JVM itself with a few queries before taking measurements in order for the JIT to kick in.

Hopefully that helps,

Charles Allen

Thanks charles for your response.

As per your suggestion i turned off the cache both at historical and broker nodes before further experiments.

This time i made sure that data was uniformly distributed across historical nodes and hence didn’t observe any bias on the cpu load across 2 nodes.

I also made another change in my setup,earlier i has only i “druid.segmentCache.locations” defined for historical node which is mounted on 1 of 6 available disk hence other disks we not getting used.During the new experiments i defined 6 “druid.segmentCache.locations” one mounted on each disk.
The new config reduced the ‘iowait’ drastically,earlier it was of the order of 30-40% now during same queries iowait is just 3-4% but still i don’t see any improvements in the query time.

One thing that i observed is the "query/segmentAndCache/time” metric,during the faster queries it is of order of 1-2 sec,whereas during the slow queries(ones with hyper unique) this time is 23 sec on average.

What could be the reason of such high “ segmentAndCache” time?

From ‘iostat’ i can observe that none of the disk is more than 70% utilised during query time.

Thanks

Rohit

Rohit, 1-2 seconds to scan a segment is extremely slow. I suspect things are misconfigured. Do you have caching turned on at the historical level? What is your segment/scan/time? How big are your segments?

Er, query/segment/time not segment/scan/time

Fangjin,
I have disabled cache at the historical for now so that it doesn’t impact the performance analysis but when we deploy this in production we will definitely enable the cache.
Each segment is of approx 250 MB with each row having 18 dimensions and 6 metrics.
"query/segment/time” is almost equal to ""query/segment/time” which is equal to 2 sec for a simple aggregation but shoots to 20 sec + for hyper uniuqe aggregation.
Can you suggest which configurations can typically effect the segment scan time?

These are the relevant configurations of historical node:

druid.historical.cache.useCache=false
druid.historical.cache.populateCache=false

druid.processing.buffer.sizeBytes=262144000
druid.processing.numThreads=30

druid.server.http.numThreads=50

Historical node has 40 CPUs.

Thanks
Rohit

Hi Rohit, 20 seconds for HLL 250mb seems very slow. How many unique values are in the segment?

Can you share the time required for a simple timeseries query with a single count aggregator over 1 segment?

Fangjin,

I have disabled cache at the historical for now so that it doesn’t impact the performance analysis but when we deploy this in production we will definitely enable the cache.

Each segment is of approx 250 MB with each row having 18 dimensions and 6 metrics.

"query/segment/time” is almost equal to ""query/segment/time” which is equal to 2 sec for a simple aggregation but shoots to 20 sec + for hyper uniuqe aggregation.

Can you suggest which configurations can typically effect the segment scan time?

These are the relevant configurations of historical node:

druid.historical.cache.useCache=false

druid.historical.cache.populateCache=false

druid.processing.buffer.sizeBytes=262144000

druid.processing.numThreads=30

druid.server.http.numThreads=50

Historical node has 40 CPUs.

Thanks

Rohit

Er, query/segment/time not segment/scan/time

Rohit, 1-2 seconds to scan a segment is extremely slow. I suspect things are misconfigured. Do you have caching turned on at the historical level? What is your segment/scan/time? How big are your segments?

Thanks charles for your response.

As per your suggestion i turned off the cache both at historical and broker nodes before further experiments.

This time i made sure that data was uniformly distributed across historical nodes and hence didn’t observe any bias on the cpu load across 2 nodes.

I also made another change in my setup,earlier i has only i “druid.segmentCache.locations” defined for historical node which is mounted on 1 of 6 available disk hence other disks we not getting used.During the new experiments i defined 6 “druid.segmentCache.locations” one mounted on each disk.

The new config reduced the ‘iowait’ drastically,earlier it was of the order of 30-40% now during same queries iowait is just 3-4% but still i don’t see any improvements in the query time.

One thing that i observed is the "query/segmentAndCache/time” metric,during the faster queries it is of order of 1-2 sec,whereas during the slow queries(ones with hyper unique) this time is 23 sec on average.

What could be the reason of such high “ segmentAndCache” time?

From ‘iostat’ i can observe that none of the disk is more than 70% utilised during query time.

Thanks

Rohit

Cool, thanks for the data.

HyperUniques are indeed “slow” (they are still the fastest cardinality estimator we’ve found so far) and one of the areas where any improvements in speed make a significant impact on overall cluster performance.

If you look at the druid console on your coordinator, does it show an even distribution of segments across the two historical nodes?

If you have 80 virtual cores and 1000x 250MB shards, the cluster is going to start at about 13 segments deep per core (effectively). Even at a segment scan time of 1 second the query should ideally return on the order of 13 seconds. (The segment times should be closer to a few hundred ms, making the total time below 10 seconds) assuming the data is paged into memory.

If you’re wanting to debug speed issues, you’ll probably want to turn off caching while you’re doing tuning because caching makes stuff waaaaay faster but less predictable (Stuff can get evicted from cache and screw up your benchmarks)

I find it odd the hyperunique and longsum aggregation didn’t return.

If you turn off caching and try the poor performing queries again, and still get no io wait when you repeatedly query, then you can probably handle a larger heap setting. Did you collect any metrics on total GC time during the different queries?

Also, as a sanity check, you’ll want to warm up the JVM itself with a few queries before taking measurements in order for the JIT to kick in.

Hopefully that helps,

Charles Allen

Hello charles,

I did analyse the cpu usage of various topN queries running on different period of data with different kind of aggregations.

My observations are summarized in the table below:

Type of Query Time Taken Size of total segments for the queried interval Cpu Idle Cpu Wait Cpu Usr Page in Page out Remarks

Top 50 with only longsum aggregation 19 sec 245 GB 30 21 48 0 0

Top 50 with hyperunique aggregation 1 minute 46 sec 245 GB 20 0 80 0 0 Query was on same dimension as previous hence page cache must have been warmed

Previous query fired again 4.7 sec 245 GB 97 0 3 0 0

Top 50 with both hyperunique and longsum aggregation 50 sec 245 GB 21 0 78 0 0 1 historical node was heavly loaded on cpu another node was totally ideal

Top 50 with both hyperunique and longsum aggregation No result even after 5 minutes 529 GB 70 20 6 Minimal(4096 KB) 0 1 historical node returned in 89 sec other never returned and i verified that full gc was not happening for non responding node

Top50 with only longsum 27 sec 529 GB 52 34 14 550 KB 6000 KB

These are few questions based on this experiment:

    • I had 2 node historical setup and 2 segments each of size approx 250 GB and having approx 1000 shards each,so whenever a query was ran on a single segment it used only one of the historical nodes,which makes me believe that 1 segment is always held completely by 1 historical node irrespective of number of shards it has and is not sharded across historical nodes.Is that true?
    • Hyperunique queries took a very long time although data was already in page cache and there was still some idle cpu at all point of time.What configuration should be changed to make these queries faster?
    • Queries on my druid setup are in general running slow even if there isn't any paging in/out happening and not much cpu is waiting on i/o.What could be the reason for the same?

Any suggestions on what could be the next steps to debug it further?

Thanks

Rohit

HI Rohit,

If your processing threads is equal to your number of cores, then your load average, even under disk io bound conditions, shouldn’t be silly high (as long as there are no other services running)

Have you checked What the CPU usage breakdown is during the long 40 second queries? I wouldn’t be surprised if your cpu is mostly io wait during that time, then mostly user during subsequent queries.

If that is the case it simply means that the page cache has not warmed up for the data files.

Also note that the JVM itself has to do optimizations when the server first starts. So your first few queries might be slow as all the JIT stuff gets optimized.

Part of what you are encountering is balancing the cpu to memory to disk ratio

I have a 2 node druid setup where each node has 64GB ram and 6TB disk with 40 CPU cores.

I need to configure druid for best performance of the TopN query over different types of aggregations namely : longSum and hyperUnique

With the current setup a normal TopN query over a single valued dimension with longsum aggregation takes around 20+ seconds when fired for first time and approx 2 secs when queried repeatedly.

On the historical node ,the metric for “query/wait/time” has value of “17935” when queried for first time.

A topN query over single valued dimension but with hyperunique aggregation takes around 40 seconds when fired first time and around 10 sec on repeated querying.

The JVM options passed to historical node are : " -Xmx4g -Xms1g -XX:MaxNewSize=2g" and i also i verified that load average is not high on these nodes during query.

The other relevant configuration of historical node are
druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

druid.processing.buffer.sizeBytes=262144000

druid.processing.numThreads=39

druid.server.http.numThreads=50

druid.server.maxSize=300000000000

The broker configurations are :

-Xmx4g -Xms2g -XX:NewSize=1g -XX:MaxNewSize=2g -XX:MaxDirectMemorySize=64g

druid.broker.http.numConnections=20

druid.broker.http.readTimeout=PT5M

druid.processing.buffer.sizeBytes=2147483647

druid.processing.numThreads=31

druid.server.http.numThreads=50

I also observed that in broker metrics “query/node/ttfb” is almost same as query time hence i believe most of the query time is spent on the historical nodes

The datasource i am querying has 933 segments each of ~260 MB and since query is without any filters it applies to entire data.

I wanted to understand what could be the reason of such a high wait time?

And also which configurations will help me optimize the cluster for topN queries?


You received this message because you are subscribed to a topic in the Google Groups “Druid User” group.

To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-user/bXSnaTHhRJ4/unsubscribe.

To unsubscribe from this group and all its topics, send an email to druid-user+...@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/407a7cb3-13f5-4274-a92c-8fd1805117dd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


You received this message because you are subscribed to a topic in the Google Groups “Druid User” group.

To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-user/bXSnaTHhRJ4/unsubscribe.

To unsubscribe from this group and all its topics, send an email to druid-user+...@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/56335223-6dd8-40c1-aa4b-62663dea0315%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Also, does increasing druid.processing.buffer.sizeBytes to 512MB help at all?

Hello Fangjn,
Can you help me with the syntax to run query over just 1 segment?

As per documentation we can run query over an interval only.

Although i tried to sniff through druid code and figured out this syntax which doesn’t seem to work as still it is selecting all the segments belonging to the interval.

I assume you are querying the nodes via a Broker. Does that broker have populateCache enabled?

Another few questions, because that segment and cache time seems odd:

What is the disk read throughput you get using any common benchmarking methods?
Are the disk devices Local or Network disk?

Are there any other processes or services which could be accessing the disk at the same time?

Are the Disks SSD or spinning drives?

Hello Charles,

Please find the answers below :

Yes i am using broker to run these queries but i have not set any cache related property neither on broker nor on historical node.
The read throughput is coming to be of order of 160 MB/s when measured using “hdparm”.

Disk devices are Local.

These boxes are exclusively being used by druid and during these experiments data ingestion was also not happening.

These are conventional spinning drives and not SSD.

Syntax:

Hello Fangjin,
The syntax you have provided is to run a query over a given interval whereas i was looking to run a query only on a single segment so that i can figure out the time taken to scan a single segment in isolation.

Here are the answers to few questions you have asked:

Q.How many unique values are in the segment?

Ans:I am still figuring out a way to run the query over a single segment.

1 day data has around 2 billion unique values distributed across 1000 segments.

Q.Can you share the time required for a simple timeseries query with a single count aggregator over 1 segment?

Ans: I ran the timeseries query over the same 1 day data with count aggregator and query returned in 1.2 sec where “query/segment/time” was around 70-100ms whereas “query/segment/time” in case of HLL was “6700 ms”.

Also, does increasing druid.processing.buffer.sizeBytes to 512MB help at all?

Ans: No visible difference in terms of query time

Ah okay, that is my bad. I’m trying to get an understanding of your scan rate for events/second/core. I think if that is reasonable, it’ll rule out weird configuration we’ve missed and focus on the data itself.