Reducing the data size.

Hi guys,

We have a data schema with 2 high cardinality dimensions. We are recording user actions, so one is a userId and the other is sessionId. Since we have hundreds of millions users, it is highly unlikely that the same userId occurs twice or more times within our 1-hour time window.

Due to these high cardinality dimensions, by using 1-hour time window, our hourly segments (with hour query granularity, which is the minimum we can go) containing 3.6M data points (~30 dimensions and 20 metrics), are 745MB for an ingestion rate of only 1k/second… This means 17.8GB per day, or 125GB a week. If we increase the ingestion rate to 5k/second the data size goes up to 3.7GB per hour or 89.4GB per day.

We tried using the roaring compression for the bitmap, but it didn’t seem to help. Actually, it increased the size by 5MB (out of 62.1MB which is the space that is taken up for 300k events)

Is there a way to reduce the storage footprint? I am asking this because with such a big storage footprint, we will end up having trouble to store all the data in the memory of the historicals and we are trying to find a way around it, so your input would be much appreciated! :slight_smile:

P.S.

I attach here part of the output of the indexing task that does the re-indexing with segment granularity HOUR and query granularity HOUR.

2016-08-17T10:43:03,934 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[en] with cardinality[16]

2016-08-17T10:43:03,980 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[en] in 47 millis.
2016-08-17T10:43:03,980 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[uid] with cardinality[501]
2016-08-17T10:43:04,022 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[uid] in 42 millis.
2016-08-17T10:43:04,023 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[cmid] with cardinality[501]
2016-08-17T10:43:04,043 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[cmid] in 21 millis.
2016-08-17T10:43:04,043 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[ddv] with cardinality[51]
2016-08-17T10:43:04,101 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[ddv] in 58 millis.
2016-08-17T10:43:04,102 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[ddn] with cardinality[51]
2016-08-17T10:43:04,109 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[ddn] in 7 millis.
**2016-08-17T10:43:04,109 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[sid] with cardinality[500,000]** ***(This can be as much as the actual ingested events in the segment)***
2016-08-17T10:43:04,246 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[sid] in 137 millis.
2016-08-17T10:43:04,246 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[ddc] with cardinality[51]
2016-08-17T10:43:04,250 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[ddc] in 4 millis.
2016-08-17T10:43:04,250 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[ddf] with cardinality[51]
2016-08-17T10:43:04,254 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[ddf] in 4 millis.
2016-08-17T10:43:04,254 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[dde] with cardinality[51]
2016-08-17T10:43:04,258 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[dde] in 4 millis.
2016-08-17T10:43:04,258 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[dn] with cardinality[1]
2016-08-17T10:43:04,260 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[dn] in 2 millis.
2016-08-17T10:43:04,261 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[lcr] with cardinality[251]
2016-08-17T10:43:04,275 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[lcr] in 14 millis.
2016-08-17T10:43:04,275 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[lcc] with cardinality[251]
2016-08-17T10:43:04,288 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[lcc] in 13 millis.
2016-08-17T10:43:04,288 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[lca] with cardinality[251]
2016-08-17T10:43:04,302 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[lca] in 14 millis.
2016-08-17T10:43:04,302 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[lcl] with cardinality[21]
2016-08-17T10:43:04,309 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[lcl] in 7 millis.
2016-08-17T10:43:04,309 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[don] with cardinality[1]
2016-08-17T10:43:04,311 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[don] in 2 millis.
2016-08-17T10:43:04,312 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[lccc] with cardinality[101]
2016-08-17T10:43:04,316 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[lccc] in 4 millis.
**2016-08-17T10:43:04,316 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[cid] with cardinality[500,000]** ***(This can be as much as the actual ingested events in the segment)***
2016-08-17T10:43:04,329 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[cid] in 13 millis.
2016-08-17T10:43:04,330 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[did] with cardinality[1]
2016-08-17T10:43:04,332 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[did] in 3 millis.
2016-08-17T10:43:04,332 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[doc] with cardinality[1]
2016-08-17T10:43:04,334 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[doc] in 2 millis.
2016-08-17T10:43:04,334 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[dof] with cardinality[51]
2016-08-17T10:43:04,338 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[dof] in 4 millis.
2016-08-17T10:43:04,338 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Starting dimension[lccn] with cardinality[6]
2016-08-17T10:43:04,340 INFO [task-runner-0-priority-0] io.druid.segment.IndexMerger - Completed dimension[lccn] in 2 millis.

I would also like to know if someone can shed some light. Thanks

It will take me a long time to write up my thoughts here, but many of them are reflected in this blog post: