Druid performance tuning questions

Hi,

my druid setup is currently slow on queries that do not have filter conditions set and druid needs to aggregate a metric over all rows. Queries that have a filter set on a dimension with cardinality > 50 run very quickly, so the filtering works quickly on large datasets.

Executing a single topn query on 20 billion rows over four metrics takes at least 12-15 seconds and goes up to 70 seconds under light load on a cluster with 27 r3.8xlarge nodes with all data memory mapped.

I got a couple of questions:

  • In the production setup guide (http://druid.io/docs/latest/configuration/production-cluster.html) it says “Historical daemons should have a heap size of at least 1GB per core for normal usage” but then further down, the historical config for a r3.8xlarge node is only given 12GB heap although the config pertains to an instance type with 32 cores.

  • the production setup guide recommends druid.broker.http.numConnections=20 for the broker whereas the benchmark settings here (https://github.com/druid-io/druid-benchmark/blob/master/config/broker-runtime.properties) has this set to 300.
    Why are these values so different? What are the implications of setting it so low or so high?

  • the benchmark blog (http://druid.io/blog/2014/03/17/benchmarking-druid.html) mentions “Interval chunking is turned on by default to prevent long interval queries from taking up all compute resources at once.”

  • when I turn chunking on, even so that only few chunks (2-10) are created, the broker query response times are twice as slow as without chunking. Furthermore there is this open ticket which reports that chunking produces incorrect numbers (https://github.com/druid-io/druid/issues/2262)
    Is it safe to use chunking in Druid 0.9.1.1?

  • documentation on chunkPeriod (http://druid.io/docs/latest/querying/query-context.html) states “Make sure “druid.processing.numThreads” is configured appropriately on the broker”. What would be an appropriately configured setting when using chunkPeriod vs not using it?

  • I read several druid blogs in which people talk about tunings like -XX:+UseNUMA, switching off biased locking (-XX:-UseBiasedLocking), zone_reclaim_mode etc. We use standard r3.8xlarge nodes on trusty ubuntu. So far, I coudln’t observe any performance improvements while playing around with such settings. Do they apply for r3.8xlarge instances at all? Are there any OS level tunings recommended on r3.8xlarge nodes?
    Any tips on any of those subjects or other suggestions are highly appreciated. I tested out pretty much every setting that Druid has but never experienced any performance improvement so far.

thanks

Sascha

one more:
is it possible to configure query-chunking on the server side or only within the query context?

Hey Sascha,

Appropriate historical heap sizing can vary a lot. You generally want it to be as small as possible so you have as much room left over as possible for page cache. The biggest things to think about are how much heap you need for local cache (if any), query time lookups (if any), and result merging (more for groupBy, less for topN, even less for timeseries). The 1GB/core recommendation in the docs is conservative – the idea is that if you follow that, you probably won’t get OOMEs, but you also might not have the absolute best tuned cluster for a given workload. The 12GB heap size in the example configs is more “balanced” but can yield OOMEs on some groupBy or QTL heavy workloads.

I’m not sure if there are people using chunkPeriod in production. It can’t be set globally, it has to be set on the query itself.

The other tunings you listed sound like things Eric recommended in a couple other threads. I believe he’s running Red Hat Linux on physical hardware. Your mileage may vary with other OSes and with virtual machines.

While you’re tuning things, try running timeseries queries too, with the same metrics and filter as your topN. Timeseries queries are much simpler and give you a good best case bound of how fast you could scan a dataset with a particular configuration.