druid config affecting query performance?

Hi All,

Our druid query performance seems to be ok under light load but degrades fairly quickly when load is increased. I initially thought it was an issue with specific queries that filtered on certain dimensions (see here). But I also wanted to take a higher level look at my configs and get some feedback as to whether there are areas that can be optimized.

Here is our setup:

3 broker nodes. Each node has:

-cpu * 8

-mem: 61 gig

-storage: 160 gig

broker node config:

druid.cache.type=local

druid.cache.sizeInBytes=104857600

druid.broker.cache.useCache=true

druid.broker.cache.populateCache=true

druid.broker.cache.unCacheable=

druid.processing.numThreads=7

druid.processing.buffer.sizeBytes=2147483647

druid.broker.http.numConnections=15

druid.broker.http.readTimeout=PT5M

druid.server.http.numThreads=30

broker node mem config:

-Xmx24g

-Xms24g

-XX:NewSize=8g

-XX:MaxNewSize=8g

-XX:MaxDirectMemorySize=16g

2 historical nodes. Each node has:

-cpu * 8

-mem: 61 gig

-storage: 1.6 TB SSD

historical node config:

druid.cache.type=local

druid.cache.sizeInBytes=104857600

druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

druid.historical.cache.unCacheable=

druid.server.maxSize=1503238553600

druid.segmentCache.locations=[{“path”:"/data/b/druid/segmentCache",“maxSize”: 751619276800},{“path”:"/data/c/druid/segmentCache",“maxSize”: 751619276800}]

druid.processing.numThreads=7

druid.processing.buffer.sizeBytes=1073741824

druid.broker.cache.useCache=true

druid.broker.cache.populateCache=true

historical node mem config:

-Xmx2g

-Xms2g

-XX:MaxDirectMemorySize=8g

-XX:NewSize=1g

-XX:MaxNewSize=1g

Segment Info:

-Our datasource schema has 22 dimensions and 7 metrics.

-Queries are only using longSum metric aggregators.

-The coordinator says the datasource has 348 gig in cold storage (this is what’s stored in the historical local disk I assume?)

-No replication on the historical nodes

-Segment granularity is DAY

-Data source has a total of 517 shards in 129 intervals.

-We have a skew in our segment sizes: previously had been around 800 MB across two shards. Increase in data in the last month now yields around 4 GB across 10 shards per segment (gradual increase).

The degradation in the query performance seems to increase when the queries contain more dimension filters, or more values to filter on per dimension. The degradation is also gradual, slowly affecting a larger percentage of the queries being made and when degraded, is very pronounced:

Timeseries queries normally: 1-10 ms

TopN queries normally: 100-200 ms

Degraded queries (both timeseries/topN): up to 30 sec.

I am already trying to optimize the client calls to not put so much concurrent load on the broker, but are there other optimizations that can be made in terms of our configuration or setup? Please let me know if there is any additional information that may be helpful.

Thank you!

-James

I should add that the timing metric is measured as the roundtrip to the broker and historical node, and back. Initially I was unclear if bottleneck was at the broker or historical node, but during periods of degradation, there appears to be a spike in CPU usage on the historical nodes (~40% cpu usage).

Some common reasons for performance degradation is that there are too many pending segments (lack of cores) or there is too much swamping as not enough segments is paged in memory at any given time.

If you can use the Druid metrics to determine where the bottleneck is, it’ll help a lot with recommendations of what to do.

Hey James,

At first glance those configs look reasonable. At least I don’t see anything awfully out of whack. So what I’d look at first is stuff like,

For the broker and historicals:

  • Are you getting a lot of time spent in GC? (the jvm/gc/time metric, or your favorite jvm monitoring tool)

For historicals in particular:

  • Are your CPUs maxed out? If so, that’s good and bad. Good because it means you’re making good use of your hardware. Bad because it means there isn’t an easy fix with your configs :). If not, then that probably means your processing thread pool should be bigger, or that a lot of time is actually being spent at the broker level (probably in merging) rather than the historical level.

  • Are you seeing a lot of disk reads? Generally Druid nodes that are able to put all their data in the page cache won’t read from disk at all during queries. Druid nodes that can’t fit their data in the page cache will read from disk a bit. The more disk reads, the worse off you are…

  • Are your segment scan times really high? query/segment/time should be a few hundred ms or less – if it’s much higher (close to a second or more), you are probably either reading from disk or hitting an inefficient case in the query processing. For this metric in particular it can help a lot to look not just at average, but also at high %iles like 95% and 99%.

Just FYI, Local cache has some nasty performance problems under heavy load

Related Issues:

I whipped up an extension: https://github.com/metamx/druid-cache-caffeine which uses caffeine for caching instead of local mapcache, but I haven’t been able to give it a full run-through at high load yet (soon hopefully) at that point I’ll add more docs as well. Also the extension requires java8.

That’s the first thing that popped out at me when I looked at your config.

Memcached cache does not have this problem, fyi (but has its own set of network related issues).

Other than that, it would be worth seeing if the math reported by the metrics makes sense: Does the query/cpu/time add up to 8*cpu% cpu seconds per second for each node? Does the segment scan time correlate with IO Wait time or user time more? etc…

Also note that https://github.com/druid-io/druid/pull/1809 fixes an issue where handling of processing was non deterministic. In 0.9.0 there is a flag you can set to make the processing queue FIFO. Such a setting has more effect under heavy loads of a large number of segments needing processing for multiple queries (check segment/scan/pending for hints here).

Thanks for all the input guys!

Fangjin I think you are right. After reviewing all the metrics, query/wait/time looked pretty horrific, so we think there are too many pending segments. Our thinking right now is that our historical nodes may not be optimal. We may swap them out for nodes with more processing power and less storage. We are also still working on reducing the number of calls from our frontend.

Just for clarification on my end: druid.processing.numThreads=7 is the number of threads that can be used to load segments in correct? What happens if we set this number beyond the number of cores available?

If you set that beyond the number of virtual cores available then you have over provisioned your CPUs. If you are mostly user-CPU bound (not IOWait bound) then it shouldn’t “hurt” anything in particular, but shouldn’t help to go too high (using 32 threads on an 8 vCore machine is probably going to make things more confusing than helpful, but using 9 threads on an 8 vCore machine might be fine). I would expect your segment scan times (and general wall-clock times) to have more variance since you aren’t “guaranteed” a cpu to do the processing.

Note that if you tune your GC to not do stop-the-world collections as much, during the non-stop-the-world collections, you’ll be competing with processing thread pool resources to accomplish the GC.

Setting a higher number of processing threads also requires a larger processing buffer pool (more memory for processing buffer).

Check what kind of CPU is being used. If it's spending a lot of time
kernel space instead of user space, it could be a kernel setting issue
we ran into recently on RHEL 6.

Another symptom is a lot of time spent on spin_locks in kernel space.
I.e. if you do a "perf top" and see a bunch of events in
"_spin_lock_irq" or "_spin_lock", this might be happening to you.

Specifically, it appears to be because of NUMA. RHEL 6 defaults to
trying to move memory around in order to make it local to the core
that is doing processing, which ends up taking out a kernel-space lock
and blocking other things from moving forward. If you have queries
coming in as this is happening, that can cause things to backup and
cause high wait times.

If it is indeed this issue, you can resolve it by adjusting your zone
reclaim setting

echo "0" > /proc/sys/vm/zone_reclaim_mode

--Eric

Hi James, if there are too many pending segments (verify with the segment/pending metric), it can either be caused because there is too much swap happening as segments are paged in and out of memory (SSDs help a lot here), single segment scans are taking too long, or too many concurrent queries compared to # of available cores.

I have a question regarding the following statement:
“- Are your segment scan times really high? query/segment/time should be a few hundred ms or less – if it’s much higher (close to a second or more), you are probably either reading from disk or hitting an inefficient case in the query processing.”

In our cluster we observe that the segment scan times depend heavily on how many metrics are being queried. If the query only asks for a single metric, the scan times are really low (20ms-200ms) but if I query over six metrics, the segment scan times are 1.5 seconds.

Is this increase normal? Is the recommendation of hundreds of milliseconds pertaining to a single metric being queried? Should the segment scan time increase when multiple metrics are included in a query or rather stay roughly constant?

The scan time should increase with multiple metrics (also with multiple dimensions if you’re using groupBy), roughly linearly.

Hello guys,

We have a little problem on our cluster.

We had 2 historical nodes and i decided to put a third in order to compare the performances. But when i query this node (with a topN) sometimes the scan time (which is close to 0 in normal case) is up to 900ms … I can’t find why.

Maybe because the load balancing is continuously running?

Any idea or help?

Thanks,

Ben

(see my other problem with load balancing -> https://groups.google.com/forum/#!topic/druid-user/gVaOKDmGDE4)

Benjamin, answering this question is extremely difficult without posting metrics related to the query.

Please see: http://druid.io/docs/0.9.1.1/operations/metrics.html