help with performance tuning

I recently ran a round of performance tests on our cluster with an eye to figure out which knobs we could turn to improve performance. If we choose to spend more money on hardware to get better performance, we want to know where to spend it and how much bang we’d get for the bucks.

This was a pretty frustrating and mostly fruitless effort. The cluster performance (measured as query latency) had a very high variability which more or less swamped the effects of any changes that I made to configs and/or host classes. I’m wondering if anyone has any advice for other things for me to try.

First of all, we are running Druid 0.9.1.1 on a cluster of 2 brokers, 2 historicals, and 3 zookeeper/coordinator/overlord nodes. The cluster is in AWS/EC2, and our baseline config uses m4.xlarges for historicals, m4.larges for brokers. We have a 100 MB local cache on the brokers. (I can post additional config info or even the full files if needed. Likewise I can post the actual queries. Info about our datasource is below.)

My test proceedure involved a script that ran through about 15 different queries in sequence (not in parallel - there might have been occasional light load on the cluster outside of the testing, but probably not much of a sequence), 20 to 25 iterations. Also it was running through our service which executes the queries and does some transformation on the responses, so the numbers are not an absolute measure of Druid’s performance, but they ought to be rougly comparable across runs with different configs. The queries are all GroupBys and a significant chunk of them use query time lookups (QTL) for one of the GroupBy dimensions. We are still using the old, deprecated static QTL configuration where a TSV is loaded from S3 (haven’t had a chance yet to update to the new stuff).

Our worst case query “locationName”, which is basically the worst case for our product, groups across three dimensions including a QTL. Doing this query over a month range covers just over 2 million rows and results in 31,435 buckets. The cardinality of the QTL dimension for this query is 13,663 - again, this is our absolute worst case. On our baseline cluster this can take a maximum time of about 45 seconds, with an average (over 20-25 tries) of about 22 seconds. (Again, this round trip time includes some service work not pure query time.) That is the time we would love to understand how we can improve. Similar queries that use a regular dimension instead of QTL generally take under a second; other QTL-based queries seem to increase latency based on cardinality. Our QTL TSV has ~153K rows.

So. I tried increasing heap, direct memory, and processing buffer on both the historical and brokers, all the way up to the recommended config of r3.8xlarge using the recommended production configs. There was some effect, but fairly marginal and not really reproducible. For example, r3.8xlarge brought the average on the big query down to 16.5 seconds though max time was not affected. But there is so much variability, it’s hard to know how much things are really changing. For example, the first change I made was to increase historical heap size from 2G -> 3G on the original m4.xlarges. In my first run this appeared to have a dramatic effect: average time went from 22 -> 12.7, max from 45 -> 26.5. But I was suspicious because I had also seen increased performance after decreasing the heap. So I reran the 3G heap tests - exact same configuration - a few days later. In those tests the performance dropped back to my original baseline level or even slightly worse.

I ran only one test where I added a historical node (still on m4.xlarge but with slightly increased heap and direct memory over baseline). This did not seem to have any impact at all.

Maybe this inconsistency is just what you get from AWS EC2 instances. And admittedly I probably did not run enough instances of the test script. Running hundreds or thousands of iterations might help even out the blips in hardware performance, but that is really time consuming. Anyway, I am left unsure about what are the most effective tweaks to make to improve performance. Any suggestions?

Our data source: due to the nature of our data and some of our use cases, our schema results in pretty much zero ingestion roll-up. Nearly every row has a unique ID. (We need the unique ids to support a different, Select-query based use case but are considering making a different data source that is a closer fit for the Group By use case that would allow for some degree of roll-up during ingestion - this is definitely conceivable, but how much difference would it make?) Our segment granularity is one day. Data is in 596 intervals, 87.5 GB, though it was slightly smaller when I ran the tests, roughly around 72 GB. Segment sizes range from around 75 MB to 300 MB. Most are pretty close to 200 MB. We have 27 dimensions and 9 metrics.

Hi Ryan, performance tuning of any large scale distributed system requires an understanding of the system itself and where bottlenecks lie. The easiest way to see where your bottlenecks are are to understand and report Druid metrics around query performance:
http://druid.io/docs/0.9.1.1/operations/metrics.html

There are also companies such as Imply that provide dedicated services and products for Druid performance tuning.

If you’re doing performance tuning, getting a good handle on the metrics coming out of druid is a must. There are multiple things that can screw around with performance including AWS noisy neighbors, network driver versions, ebs vs instance store, fs tuning, ext4 vs xfs vs btrfs vs other, java gc tunings… the list goes on for quite a while.

Also note there are a lot of improvements to groupBy in the works that are supposed to make it a lot faster (just FYI)

As a community we typically strive to make the defaults “pretty good for most use cases”.

Thanks… wish I had known about those metrics before running through my tests. My own fault for not reading the docs more thoroughly… How about adding a note to this effect to the Performance FAQ? maybe at the top under “I Can’t Match Your Benchmarked Results”. I certainly did look at that FAQ before doing my testing.

I’ll rerun some tests and capture the metrics at some point in the future so you may hear from me again, but with more useful data to chew on :slight_smile:

If you are benchmarking with groupBy and are comfortable building from source, try building from master and using the new groupBy engine. In benchmarks we’ve done it can offer a 2x–5x improvement. Docs are here: https://github.com/druid-io/druid/blob/master/docs/content/querying/groupbyquery.md#implementation-details.

The most important things to do when testing it out are:

  • Set druid.processing.numMergeBuffers on your broker and historical nodes to the number of concurrent groupBy queries you need to be able to run (and make sure you have enough direct memory available for whatever you set it to).

  • Set “groupByStrategy” : “v2” in your query context to enable the new engine for a given query.

Wow! I have not yet built from source but for that kind of potential improvement I would be willing to dive in and do it. :slight_smile:

I’m not sure when I’ll be able to get to it - right now we are experimenting with refactoring our datasource in a way that allows us to reduce the number of dimensions we have to group by, but even then we’ll still be grouping by one dimension in most cases. So we’d still be interested in the new engine. Thanks for the info!

“druid.processing.numMergeBuffers” - is that a new property only for the new engine? I don’t recognize it.

druid.processing.numMergeBuffers is currently only used by the new groupBy engine, although it’s possible other things might use it in the future. It controls the size of an offheap buffer pool for result merging, and most queries merge on-heap.