Rollup of multi valued dimensions

Lets say i have 3 dimensions in my druid data source:

A,B,C where A,B are single valued dimensions and C is a multi valued dimension and i have 3 rows of data like this:

a b c1,c2,c3

a b c2,c3,c1

a b c3,c1,c2

assuming they all belong to same time interval will these 3 rows be rolled up in 1??They just differ in the order of multi dimension value.

Lets say we have 3 more rows

a b c1

a b c1,c2

a b c1,c2,c3

Will any sort of aggregation/rollup happen across these rows??

Also lets assume i have another datasource with 20+ dimensions in which 10 dimension are frequently queried together,and queries needing all 20 dimensions are less.Will it give any significant query performance improvement if i break it into 2 data source with 1 having frequently queried 10 dimension and another having all 20 dimension??This will reduce the number of rows that first data source will contain but will increase the total storage requirement of druid cluster.

Thanks

Rohit

Hi Rohit,

For your first question, the relative ordering of values within the multivalue dimension is always assumed to be the same, so the item at index 0 will always be “c1” (or whatever it may represent). If the ordering is not consistent, I believe you will get incorrect rollup.

For your second question, rollup will not happen because the length of the value array is taken into account when comparing rows for rollup, the “grouping key” will be different.

I’m not sure about your third performance related question.

  • Jon

In my case the multi valued dimension can be thought of as “tags” where each input data can be tagged to multiple tags and there is no relevance of ordering among them.So does that mean i should sort the values before ingestion?
If the ordering is not consistent than i may not get any rollup or can i get wrong values of aggregates?

Just for my understanding,can somebody give me an use case where relative ordering of values in a multi value dimension is useful?

In my understanding they all are value of same dimension hence order should not matter.

Also,can anybody help me understand the performance impact of dividing a datasource into multiple datasources based on the “frequently queried together” dimensions in same datasource.

Inline.

In my case the multi valued dimension can be thought of as “tags” where each input data can be tagged to multiple tags and there is no relevance of ordering among them.So does that mean i should sort the values before ingestion?
If the ordering is not consistent than i may not get any rollup or can i get wrong values of aggregates?

Just for my understanding,can somebody give me an use case where relative ordering of values in a multi value dimension is useful?

In my understanding they all are value of same dimension hence order should not matter.

It shouldn’t matter.

Also,can anybody help me understand the performance impact of dividing a datasource into multiple datasources based on the “frequently queried together” dimensions in same datasource.

If you divide your data into multiple datasources, you can control the rollup granularity (query granularity) and the data retention on a per datasource basis. However, if your segments end up being too small, you hurt performance as there is overhead associated with scanning each segment.

Hi Rohit,

Fangjin is correct, my initial answer to your first question was wrong, the ordering of values doesn’t matter within the multivalue dimension, they will be sorted internally.

They must have the same length for rollup though.

  • Jon