Features with High Cardinality as both dimension and datasketch metric?

Beginner here. We’ve just implemented a Druid cluster on a bunch of AWS EC2’s.

I am wondering what’s the best practice when I have a column with High Cardinality (e.g. User ID) that I want to use as a grouping variable and at the same time as a metric? A use case for this would be to get an average of a certain metric per User ID.

Would it make sense if I declare this column BOTH as a dimension and a datasketch metric (or any count distinct type)?

What would be the drawback of adding a dimension with high cardinality? Will it complicate the metric precomputation?

I am not so sure if Druid really does precomputation for all the combination of dimensions in the background (like OLAP cubes), would like to know if this is really the case.


If you are using rollup, Druid's ingestion will aggregate up to the
dimensions specified in the spec. I.e. if you have a count aggregator
and a sum aggregator, it would effectively do a big group by at
ingestion time like

SELECT time_granularity(timestamp), dim1, dim2, dim3, count(*) AS
count_name, sum(field) AS sum_name GROUP BY dim1, dim2, dim3

If you include userId as a dimension, then it would be included in
that group by and would increase the number of rows required to store
the data being ingested. Depending on the analysis, this might be a
requirement or might not.

Specifically, when you say you want to "get an average of a certain
metric per User ID" are you meaning that you want an average of
averages, or you are wanting a global average? I.e. is it

sum_of_per_user_average / number_of_users

Or is it

sum_of_metric / number_of_users

If the latter, a sketch is fine. If the former, then you would need
to keep the userId around to be able to compute the
sum_of_per_user_average. Note, however, that computing the
sum_of_per_user_average would require an intermediate result set with
a cardinality of userId, which could be very large and resultantly
impact the runtime of that query.

Thanks for the helpful tips Eric!

I’ve decided to create two tables based from one source. One table has the UserID dimension while the other has a sketch metric run against UserID.

I am wondering how could I manage the volume of the data needed to be stored when it comes to the table with the UserID dimension. I am thinking if I should roll it up on a per week or month to lessen the volume of data.

One downside to including a high cardinality dimension is that it makes rollup/summarization less efficient. Being more aggressive with your time rollup can help, but you will never get fewer records than the cardinality of the userId column itself.