I have read somewhere that starting with Druid 0.9.0 the indexer groups input rows on the tuple formed by the timestamp (chopped up to the query granularity) and all the dimensions in the order they are specified inside the dimensionSpec - at least this is how it should be if the partitioning scheme is set to hashed, right?
I also heard that it would be advisable to order the dimensions by increasing cardinality because this would allow for a better compression and also increase the roll-up ratio.
However, there isn’t much info available on best practices and so I’d like to ask a couple of questions:
1.) I experimented with ingesting the same data several times using different dimension orders and observed small differences in resulting data volume but the roll-up ratio was always the same. To my understanding there is also no reason for why the order in which the dimensions are specified should have an effect on the rollup ratio.
2.) To my understanding, specifying dimensions in order of increasing cardinality is only part of the story. Shouldn’t dimensions that are highly correlated also be close together? How can I determine the best order at which to specify dimensions? I was thinking about constructing a perplexity-matrix and use it to find the best order of dimensions by trying to keep the perplexity as low as possible for as long as possible while adding more and more dimensions. Is that the right way to go?
3.) One of the dimensions we have is a customer id. We currently have it as one of the last dimensions because of the high cardinality, but I wonder if it would make more sense to put this dimension first. Would putting it first speed up Druid queries that filter on one customer id? I imagine that Druid would have less segments to scan if segments were first grouped on customer id.
4.) Does Druid ever skip scanning segments because it can determine upfront that a segment doesn’t contain the dimension-range filtered on?
I observe very different segment scan times depending on the filter condition I’ve set but the number of segments scanned seems to always be in the same ballpark regardless of the filter i use.