groupby latency increases a lot when the output rows increases


I’'m having the issue of groupby performance and I’m not quite sure if it’s the nature of Druid or it’s because I didn’t config it properly. When the output number of rows increases, the groupby query latency increases a lot. In one experiment, the data (after injection) is around 300M, when we were doing a 2-dim groupby with 20 rows of output, it takes only <0.1s. However when we were doing a 2-dim groupby query with 1M of rows as output, it takes 50s to finish. I did all the configuration according to the Druid Production Cluster Configuration. Does it make sense to Druid or I should probably change the configuration?

The current use case for us is trying to efficiently extract a fact table from a large hive data with schemas like follows.

D_1, D_2, D_3, … D_k, M_1, M_2, M_3, …, M_l

where D_1,…,D_k are dimensions that we are interested in doing slice n dice, M_1, M_2…, M_l are the metrics we will try to aggregate on.

Any suggestions for performance improvement?


I forgot to mention that the schema

D_1, D_2, D_3, … D_k, M_1, M_2, M_3, …, M_l

is for the fact table that we want to extract from the larger table(s) in hive.
The number of rows will be significantly less than that in the hive tables but it’s still gonna be big.


Hi Qi,

We would like to understand your use case a bit more before diving into groupBy queries. Are you trying to use Druid to power an interactive, user facing data visualizations? When you return 1M rows as the result, how do you plan to visualize these 1M results in a consumable way for the user? The reason why I ask is because I often see a workflow where Druid users try to return large result sets such that they can be loaded into a tool such as Excel or R and then visualized. Druid is designed such that the result set should be much smaller than the input set and you can query Druid directly to create visualizations.

Druid’s query language is fairly low level, and constructing visualizations is often much better done with numerous timeseries or topN queries versus a single groupBy query. This is the whole idea behind the facet.js library. The Metamarket’s product for example, does not use groupBy queries at all, and instead uses a composition of timeseries and topN queries to construct visualizations.

In your particular use case, can you arrive at the same answer with 2 topN queries (find all top values of Dim_1, and for all those values, find all top values of Dim_2) instead of a groupBy for the 2 dimensions? The latter will be much slower than the first for high cardinality dimensions.

We can talk about making groupBys faster, but in my general experience, groupBys are rarely what you actually want, but they are the simplest way to arrive at an answer.

I got you. That helps a lot. Thanks!

Hi Fangjin,

We learned a new use case and I probably need your advice.

The use case is a nested groupby. The total number of rows in the final query result is guaranteed to be small. However, the inner groupy query will generate a intermediate table with a huge number of rows. I’ll give an example here.

Suppose we have a schema like following.

D_1, D_2, D_3, …, D_k, M_1, M_2, …, M_l

where D_i is the dimensions, and M_i is the metrics.

The nested groupby works like this.

Step 1: get the aggregation value of sum of AggM_1 as

D_1, D_2, D_3, SUM(M_1)

Step 2: get the aggregation value of square sum of aggM_1 from the table we got in step 1 as

D_1, D_2, SUM(AggM_1^2)

The cardinality of D_1 and D_2 are very small so the final result size is guaranteed to be small. However, the cardinality of D_3 is large so the intermediate table is very large.

Queries like this are slow. I guess its probably because the inner query produces too many rows in the result? Does that have anything to do with memory cost, which can be optimized by allocating more memory to the historical node in its configuration file? Is there any other way to improve?

We cannot pre-compute all the intermediate tables since there are too much of them.



Hi Qi,

I think the way that query will run (assuming you did it as a groupBy with a subquery) is that the historical nodes will compute their share of the inner query and send that to the broker; if D3 is very high cardinality then this could be a large result set. The broker will materialize the inner table in its heap, then do the outer query locally. I think to speed up this kind of query the first thing to figure out is what part of that is slow- is most of the time being taken on the historicals or on the broker? You should be able to figure that out by looking at query/time and request/time metrics on the historicals and on your broker. You should also take a look at CPU use and time spent in GC.

Whichever one is the bottleneck, I’d guess it’s related to the inner query. It’s probably either not being parallelized enough on the historicals (in which case you could try using smaller segments), or it’s taking too long to merge on the broker (in which case you could try enabling historical-side merging, if you haven’t already), or the inner table materialization is using too much memory and getting you stuck in GC land (in which case you could try using more heap on whichever service needs it).

Hi Gian,

Thanks for the help.

Can you explain more about the historical side merge? I cannot find that in the doc. I also tried to grep from the src code but I didn’t find it.


It seems the bottle neck for now may be the transmission of data from historical node to the broker or the merge process in broker. If its because the transmission issue, is there any way to compress the partial result before the transmission?


Hi Qi, the results should be gzipped before they are sent over the wire. If you turn on debug level logging on the broker side, you should be able to see when a request goes out, and when the first response comes back. I suspect the problem to be merging on the broker side, but it would be good to confirm things.

Hi Qi,

Historical merging is on by default. It would be off if you enabled broker caching. The broker caching setting is “” on the broker node. It’s set to false by default.