[druid-user] Re: I wonder why the query speed is drastically slower when cardinality is high when using group by

Highly recommend reading this blog post, inspired by Gian Merlino (PMC chair)

Is it cardinality of the dimension that you are grouping by? Or is it an aggregate function that you are performing on a high-cardinality column?

  • Peter

There is a similar discussion here right now:

Suggest we regroup on there?

Oh! That is a very interesting question!!!

I do not know, I am afraid.

Hmmmm…

I suppose the theory is that if you pull high-cardinality columns to the first dimension, then there may be less thrashing between the blocks inside a segment down on the data service… So that may speed up a little??? But I suspect the issue will still be the number of rows returned up the pyramid to the broker.

Perhaps I shall ask around for you in the teams here as well – I hope others may pitch in, too!

The purpose of sorting is to reduce the cross segment merge in aggregation. Sorting by the highest cardinality first would give the highest reduction in the merge load.

Vijay

Hi!

Two additional things you may want to check: take a look at the merge buffer configurations – especially to see if the merging is spilling onto the disk, which will slow things down. Also, are you using subqueries? Subqueries will cause Druid to execute several queries, with results merged at the broker – another possible source of issues.

  • Peter

I wanted to close the loop on this as I was not happy with my own answer…

The sorting in druid is within a segment file (not across segments). So there is no impact on merge due to sorting. Sorting speeds up aggregation by speeding up the read of the data. Data is stored in columnar format and only columns referenced in a query are read (simplistically a columnar storage transposes the rows and columns). Hence reading sorted rows will be like reading contiguous columns and is hence faster. Given this it is logical that sorting by highest cardinality column first will be beneficial. However the subsequent sort order after the first column is tricky as it depends a lot on the relative cardinalities of the columns with respect to each other (after the filter has filtered out data). Sorting by different columns (after sorting by the highest cardinality first) will have a different impact on different queries. The best approach is to try out a few combinations and see which works best for a majority of the queries.

vijay