How to order dimensions in indexSpec?

I have read somewhere that starting with Druid 0.9.0 the indexer groups input rows on the tuple formed by the timestamp (chopped up to the query granularity) and all the dimensions in the order they are specified inside the dimensionSpec - at least this is how it should be if the partitioning scheme is set to hashed, right?

I also heard that it would be advisable to order the dimensions by increasing cardinality because this would allow for a better compression and also increase the roll-up ratio.

However, there isn’t much info available on best practices and so I’d like to ask a couple of questions:

1.) I experimented with ingesting the same data several times using different dimension orders and observed small differences in resulting data volume but the roll-up ratio was always the same. To my understanding there is also no reason for why the order in which the dimensions are specified should have an effect on the rollup ratio.

2.) To my understanding, specifying dimensions in order of increasing cardinality is only part of the story. Shouldn’t dimensions that are highly correlated also be close together? How can I determine the best order at which to specify dimensions? I was thinking about constructing a perplexity-matrix and use it to find the best order of dimensions by trying to keep the perplexity as low as possible for as long as possible while adding more and more dimensions. Is that the right way to go?

3.) One of the dimensions we have is a customer id. We currently have it as one of the last dimensions because of the high cardinality, but I wonder if it would make more sense to put this dimension first. Would putting it first speed up Druid queries that filter on one customer id? I imagine that Druid would have less segments to scan if segments were first grouped on customer id.

4.) Does Druid ever skip scanning segments because it can determine upfront that a segment doesn’t contain the dimension-range filtered on?

I observe very different segment scan times depending on the filter condition I’ve set but the number of segments scanned seems to always be in the same ballpark regardless of the filter i use.

thanks

Sascha

Hi Sascha,

I’ve included answers inline:

1.) I experimented with ingesting the same data several times using different dimension orders and observed small differences in resulting data volume but the roll-up ratio was always the same. To my understanding there is also no reason for why the order in which the dimensions are specified should have an effect on the rollup ratio.

Yes, you’re correct, the dimension ordering won’t affect the roll-up ratio, since the number of unique tuples will still be the same.

2.) To my understanding, specifying dimensions in order of increasing cardinality is only part of the story. Shouldn’t dimensions that are highly correlated also be close together? How can I determine the best order at which to specify dimensions? I was thinking about constructing a perplexity-matrix and use it to find the best order of dimensions by trying to keep the perplexity as low as possible for as long as possible while adding more and more dimensions. Is that the right way to go?

The main benefit of reordering dimensions is to increase the chances of getting well-compressible runs on low cardinality dimensions, so putting highly correlated dimensions next to a low-cardinality dimension early in the ordering may cut down on the segment size of those correlated dimensions as well.

Your approach sounds pretty reasonable; I wouldn’t expect to see a large compressed size reduction for high-cardinality dimensions though. It may suffice depending on your requirements to only be concerned with the ordering of low-cardinality dimensions.

3.) One of the dimensions we have is a customer id. We currently have it as one of the last dimensions because of the high cardinality, but I wonder if it would make more sense to put this dimension first. Would putting it first speed up Druid queries that filter on one customer id? I imagine that Druid would have less segments to scan if segments were first grouped on customer id.

As of now, the # of segments scanned would depend entirely on your specified interval (see answer to #4).

Within each segment, there may be some slight performance boost from having to decompress fewer blocks if the rows with a specific customer_id value tend to be clustered together, but I don’t expect that to have a very large impact on performance. Since the dimension is high cardinality, the bitmap index filtering would likely eliminate many rows from the segment, and the query wouldn’t have to scan too much per segment, regardless of where the dimension sits in the ordering.

4.) Does Druid ever skip scanning segments because it can determine upfront that a segment doesn’t contain the dimension-range filtered on?

I observe very different segment scan times depending on the filter condition I’ve set but the number of segments scanned seems to always be in the same ballpark regardless of the filter i use.

Skipping segments based on a query filter isn’t currently supported, but 0.9.2 may include such functionality:

https://github.com/druid-io/druid/pull/2982

Thanks,

Jon