Hey, I’ve got some questions about how druid process multi-value dimensions. Druid version is 0.8.2. See Details Below
Adding a multi-value dimension doubles the size of a segment (cardinality: 300+). Generally, indices of sinlge-value dimension are much more sparse than those of multi-value dimension. However, I think the root cause could be that indices of multi-value and single-value are compressed differently. Based on the latest doc here, multi-value dimension are never compressed, but I’ve found support of multi-value compression in release nodes of druid 0.7.3.
Grouping by multi-value dimension is slower than single-value dimension. The response time of running “group by” on multi-value dimension (cardinality: 300+) is much slower than on single-value dimension (cardinality: 4000+). As total RAM of all historical nodes are sufficient to host all segments (response time should not be affected much by size of indices), I wonder whether there is any different in runing “group by” query between single-value and multi-value dimensions.
Before I take a deep look at the data of multi-value and computing time of queries, I’ve got the following questions:
Are single-value and multi-value dimensions compressed different? If yes, why and how it would effect the segment size?
Is there any difference on processing “group by” query between single-value and multi-value dimensions? It yes, what’s the difference?
I could not find plenty of docs on multi-value dimension on druid.io. Are there any other posts/QAs/docs that could be helpful?
Thanks a lot!