Debugging druid query slowness

Hi,

We run a mid size druid cluster (60 historical , no realtime nodes) and most queries seem to finish under < 2-3 seconds. However we see a specific query always taking > 10 seconds. The queries are against a datasource of size 67.9GB about containing data for about one and half year with day level granularity where each shard is 150MB - 220MB with total shard count = 445. I looked at the metrics and query/segment/time seemed too high(see graph below).

I have tried to change the configuration to Historical nodes Xmx and Xms = 20480m XX:MaxDirectMemorySize=16384m and Disable broker cache as recommended for clusters > 20 nodes which seem to make the performance worst.

I haven’t yet tried to move to r3.2.8XL but I am willing to try that out.

Questions I have.

  • Any recommended changes required in the following configuration?
  • I have local caching enabled and druid common config has druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”, “io.druid.server.metrics.HistoricalMetricsMonitor”, “io.druid.client.cache.CacheMonitor”] but I can not find any caching metrics plus based on the performance I am pretty confident the caching is not active at all. Anything I am missing in config that would cause caching to be disabled? any missing config for enabling cache metrics?
  • When I get caching enabled, is the cache kept on-heap or off-heap?
  • Looking at the recommendation from r3.2.8XL, it seems like we leave a lot of memory for OS (for page cache???). If the queries are only changing once every day, is it just better to allocate more space to cache?

Cluster details:

Druid version 0.9.1.1

Hardware R3.2XL (8 vcores, 61GB memory, 160GB SSD)

60 historical nodes

2 broker nodes

Total cluster capacity: 9.43TB

Total cluster used: 3.58TB(35.7%)

Historical configuration:

HTTP server threads

druid.broker.http.numConnections=20

druid.server.http.numThreads=50

Processing threads and buffers

druid.processing.buffer.sizeBytes=100000000

druid.processing.numThreads=7

Query cache

druid.broker.cache.useCache=true

druid.broker.cache.populateCache=true

druid.cache.type=local

druid.cache.sizeInBytes=2000000000

Historical JVM configuration:

-server

-Xms3072m

-Xmx3072m

-XX:NewSize=512m

-XX:MaxNewSize=512m

-XX:MaxDirectMemorySize=8192m

-XX:+UseConcMarkSweepGC

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=var/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

Broker configuration:

druid.broker.http.numConnections=20

druid.server.http.numThreads=50

Processing threads and buffers

druid.processing.buffer.sizeBytes=100000000

druid.processing.numThreads=7

Query cache

druid.broker.cache.useCache=true

druid.broker.cache.populateCache=true

druid.cache.type=local

druid.cache.sizeInBytes=2000000000

Broker JVM configuration:

-server

-Xms12288m

-Xmx12288m

-XX:MaxDirectMemorySize=16384m

-XX:NewSize=1536m

-XX:MaxNewSize=1536m

-XX:+UseConcMarkSweepGC

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=var/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

The Query:

{

“queryType”: “groupBy”,

“descending”: “false”,

“granularity”: “all”,

“intervals”: [“2015-10-10T00:00:00.000Z/2016-10-10T00:00:01.000Z”],

“dataSource”: “test”,

“aggregations”: [{

“type”: “longSum”,

“name”: “i”,

“fieldName”: “i”

}, {

“type”: “longSum”,

“name”: “q”,

“fieldName”: “q”

}, {

“type”: “longSum”,

“name”: “p”,

“fieldName”: “p”

}],

“dimensions”: [“a”, “b”]

}

Appreciate any response.

Thanks

Parth

Hi,
by default groupBy queries are not cached.

Try explicitly setting below property in broker runtime.props

druid.broker.cache.unCacheable=[]

Appreciate your reply, I will enable that and I am guessing that would enable the group by caching. That still does not explain why the first time around it takes upwards of 15 seconds to load the page. Do you see any configuration that needs modification? What is the reason for query/segment/time being so high?

Thanks

Parth

Hi,

first off: I’m not a tuning expert, so take the following with the appropriate level of scepticism…

Maybe you could try two things.

1.) Seeing that you have set your direct memory to 8GB, I assume that it is your intention of have processing buffers of 1GB per core on the historicals which is very reasonable, but it seems to me that your config settings for the processing buffers are missing one zero and are therefore currently set to 100MB:
druid.processing.buffer.sizeBytes=100000000

Some folks say that 1GB is too much and that 500MB might also do, but I would consider that fine-tuning. So for now, I’d set it to 1GB.

2.) The heap memory you configured seems to be rather on the low end. Most people recommend setting the heap and direct mem as low as you can so that you have the rest of mem available for memory mapping files. While this is true, this is fine tuning, so I would first try with heap settings that are definitely sufficient and only play around with lowering them as a second step.
On our cluster, we observed that the segment scan times were uneven across nodes and were also fluctuating. This seems to be a good sign that your heap memory is too low. I would suggest to have 1GB of heap per core for testing out what CPU utilization and segment scan times you get. Since these settings would reduce your memory size for memory mapping segments in, this might hurt your memory-to-disk ratio in production and so it would be best if you had the chance of testing this outside of production first.

I am attaching some screenshots of metrics that demonstrate the effects we see when the heap memory is too low.

3.) you could try -Xms8g -Xmx8g -XX:NewSize=4g -XX:MaxNewSize=4g, so whatever you set as the heap size, set half of that as new size.

As you have nodes with 61GB RAM, you so far had 3GB of heap and 8GB of direct men, so 50GB per node were available for memory mapping. With an additional 5GB of heap mem, you are still left with 45GB for memory mapping, so you are loosing only 10%.
If these settings work well, you could try lowering the direct mem towards the 7GB mark because only 7 of your 8 cores is involved in processing segments and 1GB per core should be enough for the processing buffer. Maybe you can eventually shave off 1-2GB from your heap settings without penalty, but for now I wouldn’t risk it.

Same quey sent for different heap sizes. When the heap was too low, we observed those oscillating scan times. One thing to be cautious about: in the middle we had a heap size that produced the even segment times, but then I ran a test that will send many different queries so that the caches would fill up (this is the gap in the middle) After that the segment scan times for the test queries began to oscillate again. So even if a freshly started cluster has good metrics, it could still be that the heap settings are too small.
On the right, we had 1GB per core configured and this lead to a stable performance.

The following shows the segment scan times we observed in production. As you can see, the uneven distribution of segment scan times across nodes yields to some nodes having scan times of up to 5 or even 10 seconds. We haven’t updated our production system with new memory settings yet, but in our test systems we observed as showed above that the issue is probably that after having been in operation for a while, a node doesn’t have enough heap to fully utilize all of its cores, hence the uneven distribution in scan times.

The following screenshot shows the CPU utilization for different heap sizes. With 1GB per core (right-most section of graph) the utilization is just right, as not only too low is an issue but it being too high is also not yielding the best throughput.

With the right heap settings, the oscillating scan times in the first graph above would reduce to something much nicer:
Below you see a sequence of six different types of queries send sequentially: x queries of type 1, then x queries of type 2 and so on.

Results show that complex metrics (with filter expressions) and numeric sort order causes the biggest stress on memory and GC and leads to performance penalties. The implementation for numeric sorting looks quite abusive as it receives string objects and instantiates BigDecimal objects from them, just for comparing which of two is numerically smaller.

The leftmost scan times are for alphanumeric sorting, the rightmost are for numeric sorting with otherwise same query type. As one can see, the segment scan times with numerical sorting are four times slower, so there is a huge potential for speed improvements if someone were to tune the numerical sort routines in Druid.