I read this question about what makes GroupBy query so slow.
However, I didn’t get it all.
I want to know technical details on ‘what’ exactly makes GroupBy query slow than TopN.
As far as I know, Druid uses bitmap index.
In that article, groupby make an in-memory ephemeral data source, is it because of the bitmap index?
I’d like to know more about in-memory ephemeral data source, temporary segment.
How close do the docs get in answering the question?
I note that the discussion you linked to is quite oooooold now… and there have been lots of optimisations since that was written - like a new GROUP BY engine in 0.9.2 in 2016, and more recently vectorisation in 0.19 last year,
I already read the document you told me, but I don’t understand.
The document says v2 strategy engine of groupby uses off-heap maps, why use maps?
The reason is for build a tuple of id of each of the dimension for each row?
Then, we use the map for aggregation. Is it right?
If I groupby with multiple high cardinality columns, does the number of tuples created make groupby query slower?