Why is group by query slow?

I read this question about what makes GroupBy query so slow.
However, I didn’t get it all.

I want to know technical details on ‘what’ exactly makes GroupBy query slow than TopN.

As far as I know, Druid uses bitmap index.
In that article, groupby make an in-memory ephemeral data source, is it because of the bitmap index?
I’d like to know more about in-memory ephemeral data source, temporary segment.

@Gian_Merlino2 Help Me :blush:

How close do the docs get in answering the question?

I note that the discussion you linked to is quite oooooold now… and there have been lots of optimisations since that was written - like a new GROUP BY engine in 0.9.2 in 2016, and more recently vectorisation in 0.19 last year,

1 Like

hi peter!

I already read the document you told me, but I don’t understand.

The document says v2 strategy engine of groupby uses off-heap maps, why use maps?
The reason is for build a tuple of id of each of the dimension for each row?
Then, we use the map for aggregation. Is it right?

If I groupby with multiple high cardinality columns, does the number of tuples created make groupby query slower?

You have several issues when you GROUP BY a high cardinality column.

Yes, there’s the map, but also you are returning many many rows up the query pyramid from each of the data processes (historical / middle manager) in the fan in stage up to the broker.

Imagine this is a normal GROUP BY:

Hist 1 = 5m rows == GROUP BY ==> 50 rows =\
Hist 2 = 5m rows == GROUP BY ==> 30 rows =|
Hist 3 = 5m rows == GROUP BY ==> 25 rows ==> Broker Merge ==> 30 rows
Hist 4 = 5m rows == GROUP BY ==> 10 rows =|
Hist 5 = 5m rows == GROUP BY ==> 30 rows =/

When Druid GROUP BY on a high cardinality column, the broker is very much busier:

Hist 1 = 5m rows == GROUP BY ==> 3m rows =\
Hist 2 = 5m rows == GROUP BY ==> 3m rows =|
Hist 3 = 5m rows == GROUP BY ==> 2m rows ==> Broker Merge ==> 4.5m rows
Hist 4 = 5m rows == GROUP BY ==> 5m rows =|
Hist 5 = 5m rows == GROUP BY ==> 4m rows =/

You should read Data Modeling and Query Performance in Apache Druid for sure.

OH and there’s also this awesome video from @Gian_Merlino2 at Virtual Druid Summit…