For a data source that I am setting up, I’m focusing on two main priorities:
- Make it fast to filter queries by a State
- Make it fast to count distinct Household IDs
According to this tutorial vid, I can get speedy reads while filtering by State if I make State the first column in the dimensions list. This would sort the rows in the segments by State, optimizing filtering efficiency.
Also, according to the docs on the DistnctCount Aggregator, in order to not over count while using distinctCount
I need to “use a single dimension hash-based partition spec to partition data by a single dimension”, i.e. by household_id
.
So if I do both, how would the partitioning affect the dimension ordering? Would I lose the speedy filtering by State, since I’ve partitioned the data by a different dimension? Or is Druid still able to keep state
sorted while partitioning on household_id
, giving us the the benefits of both techniques?