How does druid compress multi-value and run "group by" query?

Hey, I’ve got some questions about how druid process multi-value dimensions. Druid version is 0.8.2. See Details Below

Observation:

  1. Adding a multi-value dimension doubles the size of a segment (cardinality: 300+). Generally, indices of sinlge-value dimension are much more sparse than those of multi-value dimension. However, I think the root cause could be that indices of multi-value and single-value are compressed differently. Based on the latest doc here, multi-value dimension are never compressed, but I’ve found support of multi-value compression in release nodes of druid 0.7.3.

  2. Grouping by multi-value dimension is slower than single-value dimension. The response time of running “group by” on multi-value dimension (cardinality: 300+) is much slower than on single-value dimension (cardinality: 4000+). As total RAM of all historical nodes are sufficient to host all segments (response time should not be affected much by size of indices), I wonder whether there is any different in runing “group by” query between single-value and multi-value dimensions.
    Before I take a deep look at the data of multi-value and computing time of queries, I’ve got the following questions:

  3. Are single-value and multi-value dimensions compressed different? If yes, why and how it would effect the segment size?

  4. Is there any difference on processing “group by” query between single-value and multi-value dimensions? It yes, what’s the difference?

  5. I could not find plenty of docs on multi-value dimension on druid.io. Are there any other posts/QAs/docs that could be helpful?

Thanks a lot!

Hi there!

You and I are kindred spirits. Here I am implementing first class multi-value dimension support into Plywood (https://github.com/implydata/plywood/tree/work_with_set) running tests on Druid and your question pops up.

Let me drop some knowz, I hope it will be helpful in some way.

Imagine you have this dataset:

{ “name”: “Honda CR-V”, “tags”: [“SUV”, “AWD”], “options”: [“green”, “automatic”] }

{ “name”: “Honda Accord”, “tags”: [“hatchback”, “FWD”], “options”: [“blue”, “manual”] }

*This is actually my partial car history. Not included is the Moskvitch 2141 Svyatogor which was my first car.

  1. Every value in a multi-value dimension gets its own index so there will be 4 entries in the options index and 4 in the tags. So MV dims can ‘explode’ your indexes in size but they are fast to filter "SUV IN tags"

  2. When you query these values you get a set explosion - I call it the hyper-crossproduct-expansion™:

plyql --host 192.168.99.100:8082 --introspection-strategy datasource-get -i P1W -q “SELECT options, tags, COUNT(*) as Count FROM cars GROUP BY options, tags”

[

{

"options": "automatic",

"Count": 1,

"tags": "AWD"

},

{

"options": "automatic",

"Count": 1,

"tags": "SUV"

},

{

"options": "blue",

"Count": 1,

"tags": "hatchback"

},

{

"options": "blue",

"Count": 1,

"tags": "FWD"

},

{

"options": "green",

"Count": 1,

"tags": "AWD"

},

{

"options": "green",

"Count": 1,

"tags": "SUV"

},

{

"options": "manual",

"Count": 1,

"tags": "FWD"

},

{

"options": "manual",

"Count": 1,

"tags": "hatchback"

}

]

So the results are double (or quadruple counted).

This means that (depending on your dimension) it can really explode the group by and make it slow which is why you would defiantly want to go for the much faster aprox. topNs.

FWIW this treatment of MV dimensions is actually really useful because it is so powerful. This is why I am mimicking it in Plywood. If you do not need this enormous power there are plenty of way you can ingest your data differently to trade power for speed.

Also if you want to make use of MV dimensions in Plywood or PlyQL I would wait till that branch I linked to is merged (and deleted).

Vadim

Woops, forgot to post an example of that a single group by looks like:

plyql --host 192.168.99.100:8082 --introspection-strategy datasource-get -i P1W -q “SELECT options FROM cars GROUP BY option”

[

{

"options": "automatic"

},

{

"options": "blue"

},

{

"options": "green"

},

{

"options": "manual"

}

]

Hi Vadim,

Thanks for your reply. Supporting set in Plywood is cool. Sure I benefit a lot from druid multi-value dimensions.

My question is more about druid internals.

  1. For both single-value and multi-value dimensions, druid builds one index for each unique value of the dimension. See the doc here: http://druid.io/docs/latest/design/segments.html. In my opinion, the only different is that indices for single-value dimension are much more sparse (only one value per row). So I think compression plays a very important role in segment size.

  2. I think group by query may check every built index. I was wondering whether druid has some optimizations for single-value dimensions. For example, for single-value dimensions group by query runner could skip rows that already have value when checking certain index.

At the storage level the multi valued format is a pretty straight forward extension of the single valued format. In a single valued column we store one dictionary encoded value (an int) per row, plus one compressed bitmap per value. In a multi valued column we store potentially many dictionary encoded values per row, and also store one compressed bitmap for each individual value. There is also an extra level of indirection to enable random access. And yeah, compression is supported for both of them now, we’ll fix the docs. Thanks for the catch.

Multi valued columns tend to be bigger for three reasons, 1) the indirection, 2) there are usually more values stored per row, and 3) each bitmap is usually somewhat denser and moderate density bitmaps do not compress as well as very sparse or very dense bitmaps.

Queries on multi valued columns tend to be slower. I’m not aware of any optimizations that the multi valued grouping queries are missing out on, I think they’re slower because there is just plain more work to do. There’s more data to scan and there are more groups in the intermediate result sets (since each value gets its own group).

If I am missing something then hopefully someone else will correct me :slight_smile:

Hi Gian,

The segments design doc (http://druid.io/docs/latest/design/segments.html) says that for every dimension, there are three data structures: dictionary that encodes columns values, column data and bitmaps. I thought for dimension druid only stored value dictionary and bitmaps. Why does druid store values per row for each dimension? Is it for random access or something else?

I just read docs about druid queries and filters. I don’t know why random access is necessary in druid. Is it required by other druid functionalities? Any docs related to it?

As for the group by queries on a dimension, from the client side, druid returns aggregations of each possible value. So my I think druid works in following steps: 1. get all values of the dimension; 2. For each value, get all targeted rows; 3: do the aggregation. For sure druid query walker is much more complicated, but the overall processes of group by queries on single-value and multi-value dimensions look the same.

Follow-ups would be of great help.

Random access is useful when you are filtering on one column and then grouping on values from another column. The query process at a high level is 1) resolve filters into a single bitmap using the indexes, 2) use that bitmap to create a cursor, 3) query engines use the cursor to read from any columns involved in grouping or aggregation.

Thank you. It’s very helpful