What makes GroupBy so slow?

So I hear a lot about how TopN and TimeSeries are way faster than GroupBy.

However, there is no technical details on ‘what’ exactly makes that happen.

The most I have heard is that TopN’s use’ approximation’ (that word by itself is not enough detail).

GroupBys are much easier to program against and a lot of times do the job of multiple TopN/TimeSeries statements.

Without any details about what makes GroupBys slower, it is almost impossible to currently weigh your choices of easier programming vs faster UI.

Can someone please provide some technical details on the internals of GroupBys?

A simple useful example might be to show how a GroupBy might go about processing data vs a TopN.

Great question.

Roughly speaking:

groupBy’s make an in-memory ephemeral data source.

topN creates a bound priority-queue of results.

topN’s are our most common query at MMX so there has been a lot of optimizations on topNs.

groupBy is something I haven’t tinkered with much, and hasn’t really had a performance optimization pass as far as I know.

But roughly speaking, topNs are faster because the intermediate data type is faster, and topNs have had more optimization passes.

how does the ephermal data source imoact performance?
is it a giant buffer where all data is copied and then queried?

more details are still needed :slight_smile:

You can think of it like a temporary segment (groupBys actually use a lot of the same code that segment-creation uses).

do i have this correct:

GroupBys are slow because they do a copy of the data to a temporary buffer.

Thus, the speed penalty is due to the extra copying involved


Mostly. It’s not just that it is “some” temporary buffer, it is because the temporary buffer is a fully functional and queryable (and malleable) segment.

In other words, it copies it in such a way that focuses on versatility rather than performance.

So far this is what I have understood about GroupBys:

  1. They copy data into a buffer.

  2. The buffer looks like a segment, due to which it is even slower than a ‘regular’ buffer.

Since I (or any other druid user) dont know the internals of druid or druid segments, the 2nd point is not understandable by anyone.

So the only thing I got is that ‘data is copied into a buffer’.

Considering that, GroupBys should be faster if you have to say scan for 20 different dimensions, since the overhead of scanning the same rows again and again using TopN, and the network overhead would seem to be much larger than doing a single GroupBy.

Am I missing something here?

More details will be highly appreciated. Please assume that your readers dont know the internals data structures of Druid (eg - Segments).