Compression seems low

It seems like the amount of compression I’m getting with Druid is less than I would expect. I can take the records I’m storing in Druid and get 8-12x compression using default gzip compression, but with Druid only get 2-3x (HDFS then takes it and gets another 3x).

Is there some setting I’m missing? Or does Druid just not compress well? It seems like it should do a good job because it is a column-store, which usually makes compression better…

Any thoughts?

Thanks!

Ron Hay

Advanced Developer

Arbor Networks

Thanks for taking an interest! Compression is something we have been taking a lot of baby steps (and big steps) on in the last year.

Currently the dimension values are not compressed, meaning if you have very large dimension names with high cardinality (ex: full URLs with query params) then you won’t get that great of compression.

What druid version are you using? Since compression has changed notably “recently”, your results are going to vary depending on which version you are using.

If you REALLY want to see what is taking up the space, you can look at the meta.smoosh file in the index.zip blobs from your deep storage (we do this on occasion to see if we can better-optimize some of our data stores)

it will look something like this:

v1,2147483647,1
__time,0,415956928,416288124
some_dim_1,0,0,40666
some_dim_2,0,40666,81415
some_dim_3,0,81415,184262
some_dim_4,0,184262,227321

``

The first row has to do with versioning, but for the other rows:

The first column is the dimension name, and the third and fourth column are the byte bounds per “thingy you can ask about in a query”

The second column is an index to specify which #####.smoosh file you want (usually its just 0 for 00000.smoosh)

There is currently no way to specify methods to maximize compression of the compressed indices (for example: ordering dimensions within a particular query granularity to optimize bitmap size by maximizing run lengths) but such an optimization would probably require query path changes as well, and it is not immediately clear that such a change would result in a net speed improvement.

Thanks for the response - it looks like we’re moving ahead with a more advanced prototype to test with some of our customers, so I’m starting to dig deeper into Druid. (I wrote a proof of concept over the summer and had solid success). Compression is a big issue, as our customers would potentially storing huge amounts of data.

I’m using .7.1.1 currently. Would I see significant improvements with .8.1?

It seems like compression would almost always help with queries, given the biggest bottleneck is getting the data off the disk - especially in TopN and other accumulative queries that have poor cache performance because the data is significantly bigger than memory.

Ron

Ron,

We implemented significant compression changes in 0.7.3 and 0.8.0, so you should give 0.8.1 a try.

It should reduce your data size quite a bit, especially if you have relatively sparse dimensions.

In our experience with real-world data, we’ve seen segment sizes reduce by 50% on average, but it’s hard to tell without knowing the nature of the data.

Try it out and let us know!

Thanks,

Xavier

huge

Great news! I’ll grab it and give it a shot and let you know.

Thanks again for the quick responses - they are a big reason I’ve been willing to continue advocating Druid for this project.

Ron

It is worth noting that after extensive testing, we settled on lz4 for compression as a great balance of space and speed (you can choose between lz4, lzf, and uncompressed). So you may not get as good of compression as gzip, but you should get better speed than gzip.

I’m also curious to ask what you are setting for your query granularity and what your rollup ratio is. Having good rollup can dramatically reduce the volume of data you have to store.

We’re not using rollup at the moment (queryGranularity of “none”), although that is an option I want to explore - we’re storing a type of netflow data, incidentally. Our product already stores netflow in a “rolled up” state, what I’m working on is to store things in a more raw form, hence why I haven’t messed with queryGranularity yet.

But thanks for the pointer, it is something I had forgotten I want to test. How well does rollup with data with lots of dimensions (30+)?

Ron

Hey Ron,

The efficacy of rollup depends on how often you see events with the same combinations of dimension values. It’s really dependent on your particular dataset and on your choice of dimensions. One example is that if you had a pageviews dataset with ‘url’, ‘browser’, ‘referrer’, ‘country’, and ‘session_id’ dimensions, you would probably not get great rollup, because there aren’t going to be a ton of events from the same session_id on the same url. But if you drop the ‘session_id’ dimension, rollup would probably improve quite a bit overall.

In production, I often see 40x reduction in size of raw data when we enable rollup.