Druid schema storage

Hi everyone,

I am wondering if Druid optimizes / compresses column value when storing data. For example, a schema looks like:

field1 | field2 | field3

It will optimize / compress at the granularity of your requested query granularity. So as long as those events are within the same query granularity (MINUTE for example) then any unique combination of dimension values will have their metrics rolled up together. So I’m assuming that field2 and field3 are not dimensions you want to index, but rather metrics you want to aggregate.

Hi Charles,

Thanks for the response.

So I’m assuming that field2 and field3 are not dimensions you want to index, but rather metrics you want to aggregate.

Yes, field1 is dimension while field2 and field3 are aggregators.

I was curious how people are storing repeated data. In my case, in order to save space, I can also store field1 as A and B corresponding to SourceA and SourceB respectively. But I am not sure, if it’s going to have impact in how they are being stored by Druid.

Hi Prajwal,

You may be interested in reading:

http://druid.io/blog/2011/04/30/introducing-druid.html

which talks about how Druid rolls up data.

You may also want to read the white paper:

http://static.druid.io/docs/druid.pdf

which talks about how Druid compresses columns.

With respect to storing “SourceA” as “A” and “SourceB” as “B” –

I am assuming you are worried that if "SourceA "appears in multiple rows on the segment then a lot of storage is wasted in storing “SourceA” multiple times and you could save some space by storing “A” in all those places instead.

Generally, That is not to be really worried about. Druid will keep a dictionary encoding e.g. “SourceA” -> 1, “SourceB” -> 2 etc . Row data will store something like an array of ints for example [“SourceA”, “SourceB”, “SourceA”] would actually look like [1,2,1] on disk.

In the special case when cardinality of field1 is very very high with little repetitions in row data then your optimization may help a little bit.

– Himanshu

Thanks Fangjin. Re-reading white paper and blog posts are clearing things up.

I am assuming you are worried that if "SourceA "appears in multiple rows
on the segment then a lot of storage is wasted in storing “SourceA” multiple times and you could save some space by storing “A” in all those
places instead.

Himanshu, yeah I was concerned about that but since Druid keeps them as dictionary keys, I’m not too much worried now. Thanks for explanation.