[druid-user] Multi-value dimension performance compared to Datasource with 30,000 dimension

I actually have some experience on this related to audience segmentation. I’m assuming that any subset of the 30k values might be present in an event, and the data looks something like this:


and you want to do things like “find the average of the last column for events with e in the multi-value column”.

One of the key considerations you need to look at for something like this large of a value set is how the topN behaves in a list-value dimension (aka multi value dimension). Due to how the topN works, rows are included and excluded FIRST and aggregated in a topN SECOND. That means that if you do a topN on a dimension with up to 30k multi-values, it is entirely possible that the value you filtered on does not actually appear in the topN result! For example, if you filter to include some uncommon value, but all the matching rows have MANY of some common values, then your topN is going to be all the common values since the filter happened first, selected the matching events, and the aggregation happened second, aggregating top topN of the dimension values. This means the following scenario happens:



If you ask “how many events had the label a?” the answer is 2. If you ask “how many events had the label b?” the answer is 1. If you ask “What are the top labels by count of events they are in” you would get 2 and 1 for a and b respectively in a topN… but the total of those together (2+1=3) is more than the number of events (2). It is just the nature of the question you are asking when you have multi-value columns.

From the other extreme is putting all the dimension values in their own dimension. In this scenario your topN is also not useful because you have no values to compare to each other.

IFF you are only filtering and counting, and never comparing two dimension values to each other, then just doing a big multi-value dimension should be pretty easy. Doing each its own dimension makes data discovery pretty terrible.

But if you are wanting to compare things to each other, then figuring out a way to get things that are related into the same dimensions will help your sanity a lot. For example, in the thing above, if a and b are related because they represent similar concepts (like maybe a country), and c, d, e, and f, are related because they represent some other concept (like lets say, favorite food). Then a dimension for “country” and a dimension for “favoriteFood” would make sense. Maybe someone has both e and f in their favorites, that’s ok, at least when you do queries they are easier to interpret the results, even if you have the topN oddities called out above. If a taxonomy exists for the labels, then splitting dimensions in a way that relates to the taxonomy will probably make your lives easier. Especially if there are any subsets that are mutually exclusive.

Hopefully this helps,

Charles Allen