GroupBy v2 tunning

Hi Everyone!

We are running the just released Druid 0.9.2 version and we are very happy with it so far.

Also we are currently testing groupBy v2 because on release comments it says that it has a 2-5x improvement, which if true that would make a big difference for us. Of course for client-side queries we use topN and timeseries, but this is for some background processes that rely on Druid’s groupBy and have been working pretty good so far.

We have historicals and brokers for 32GB RAM and 12 cores. And for jvm direct memory we assigned 31GB. And for the first tests we added the following config on them:

druid.processing.buffer.sizeBytes=1073741824

druid.processing.numThreads=11

druid.processing.numMergeBuffers=2

More precisely, we added druid.processing.numMergeBuffers=2 because we didn’t know benchmarks existed for this option (because it’s very new) and how could it affect the whole cluster.

So after that we saw a little increase. Some heavy groupBy that lasted ~7s now lasts ~6s, or another that lasted ~5s now runs at ~3.5s , which is a good improvement but not 2x. Some good point is that this group by’s are on low cardinality dimensions.

What we need to know is how could we improve this new groupBy engine from configuration point of view. For example, if we put a higher value on numMergeBuffers, what does it mean? Could it effectively speed up the query? Are there any other config values that we could try to modify?

Thanks!

Did you also pass “groupByStrategy”: “v2” in your query context? If not then the new engine is not actually being used.

If it is being used then most of the tunings you can do are involving spilling. If the query isn’t spilling, then there aren’t that many useful knobs. If you haven’t set maxOnDiskStorage, then it’s definitely not spilling, since by default spilling is disabled.

Raising numMergeBuffers just lets you run more queries concurrently, but it doesn’t affect the performance of any one specific query.

Raising buffer.sizeBytes can improve performance if you’re getting spilling, although if you aren’t then it probably won’t do much. If you only have 32GB RAM on your historicals, you may actually want to reduce your buffer.sizeBytes, since that will give you more space for disk cache. Whether that matters depends on how much data you have though.

Hi Gian,

Yes groupByStrategy was added on query context as well. And I can tell that new engine is on because there is a better query performance, despite it’s not very high it is measurable.

I read a little bit about spilling and seems interesting, I’m going to try to add some disk space to this group by queries to see how it goes and also have buffer.sizeBytes in mind. I find a little bit hard to test performance on Druid because each query has it’s own tunning configs and one change can make one query faster but others slow, so there must be a balance between them according to our application needs.

Thanks for the tips and your time !