How does partitioning interact with dimension ordering

For a data source that I am setting up, I’m focusing on two main priorities:

  1. Make it fast to filter queries by a State
  2. Make it fast to count distinct Household IDs

According to this tutorial vid, I can get speedy reads while filtering by State if I make State the first column in the dimensions list. This would sort the rows in the segments by State, optimizing filtering efficiency.

Also, according to the docs on the DistnctCount Aggregator, in order to not over count while using distinctCount I need to “use a single dimension hash-based partition spec to partition data by a single dimension”, i.e. by household_id.

So if I do both, how would the partitioning affect the dimension ordering? Would I lose the speedy filtering by State, since I’ve partitioned the data by a different dimension? Or is Druid still able to keep state sorted while partitioning on household_id, giving us the the benefits of both techniques?

I’m not familiar with the distinct count aggregator, but the docs recommend APPROX_COUNT_DISTINCT_DS_HLL which should be wired up correctly to the COUNT(DISTINCT expr) expression in SQL. SQL · Apache Druid

I’m not aware with any restrictions on accuracy with these aggregators, so you should be able to use range partitioning on any dimensions and get accurate count distincts on non partitioned dimensions.

Hope this helps