I recently ran a round of performance tests on our cluster with an eye to figure out which knobs we could turn to improve performance. If we choose to spend more money on hardware to get better performance, we want to know where to spend it and how much bang we’d get for the bucks.
This was a pretty frustrating and mostly fruitless effort. The cluster performance (measured as query latency) had a very high variability which more or less swamped the effects of any changes that I made to configs and/or host classes. I’m wondering if anyone has any advice for other things for me to try.
First of all, we are running Druid 0.9.1.1 on a cluster of 2 brokers, 2 historicals, and 3 zookeeper/coordinator/overlord nodes. The cluster is in AWS/EC2, and our baseline config uses m4.xlarges for historicals, m4.larges for brokers. We have a 100 MB local cache on the brokers. (I can post additional config info or even the full files if needed. Likewise I can post the actual queries. Info about our datasource is below.)
My test proceedure involved a script that ran through about 15 different queries in sequence (not in parallel - there might have been occasional light load on the cluster outside of the testing, but probably not much of a sequence), 20 to 25 iterations. Also it was running through our service which executes the queries and does some transformation on the responses, so the numbers are not an absolute measure of Druid’s performance, but they ought to be rougly comparable across runs with different configs. The queries are all GroupBys and a significant chunk of them use query time lookups (QTL) for one of the GroupBy dimensions. We are still using the old, deprecated static QTL configuration where a TSV is loaded from S3 (haven’t had a chance yet to update to the new stuff).
Our worst case query “locationName”, which is basically the worst case for our product, groups across three dimensions including a QTL. Doing this query over a month range covers just over 2 million rows and results in 31,435 buckets. The cardinality of the QTL dimension for this query is 13,663 - again, this is our absolute worst case. On our baseline cluster this can take a maximum time of about 45 seconds, with an average (over 20-25 tries) of about 22 seconds. (Again, this round trip time includes some service work not pure query time.) That is the time we would love to understand how we can improve. Similar queries that use a regular dimension instead of QTL generally take under a second; other QTL-based queries seem to increase latency based on cardinality. Our QTL TSV has ~153K rows.
So. I tried increasing heap, direct memory, and processing buffer on both the historical and brokers, all the way up to the recommended config of r3.8xlarge using the recommended production configs. There was some effect, but fairly marginal and not really reproducible. For example, r3.8xlarge brought the average on the big query down to 16.5 seconds though max time was not affected. But there is so much variability, it’s hard to know how much things are really changing. For example, the first change I made was to increase historical heap size from 2G -> 3G on the original m4.xlarges. In my first run this appeared to have a dramatic effect: average time went from 22 -> 12.7, max from 45 -> 26.5. But I was suspicious because I had also seen increased performance after decreasing the heap. So I reran the 3G heap tests - exact same configuration - a few days later. In those tests the performance dropped back to my original baseline level or even slightly worse.
I ran only one test where I added a historical node (still on m4.xlarge but with slightly increased heap and direct memory over baseline). This did not seem to have any impact at all.
Maybe this inconsistency is just what you get from AWS EC2 instances. And admittedly I probably did not run enough instances of the test script. Running hundreds or thousands of iterations might help even out the blips in hardware performance, but that is really time consuming. Anyway, I am left unsure about what are the most effective tweaks to make to improve performance. Any suggestions?
Our data source: due to the nature of our data and some of our use cases, our schema results in pretty much zero ingestion roll-up. Nearly every row has a unique ID. (We need the unique ids to support a different, Select-query based use case but are considering making a different data source that is a closer fit for the Group By use case that would allow for some degree of roll-up during ingestion - this is definitely conceivable, but how much difference would it make?) Our segment granularity is one day. Data is in 596 intervals, 87.5 GB, though it was slightly smaller when I ran the tests, roughly around 72 GB. Segment sizes range from around 75 MB to 300 MB. Most are pretty close to 200 MB. We have 27 dimensions and 9 metrics.