Aggregating older data into bigger granularities

Hi

We are working on a system for which we are feeding data into druid 0.6.146 via kafka 8 in minute granularity. But we need the data in minute granularity for a few hours only. After that we would prefer to keep them in hourly (or bigger) granularity. Is there some option to do that?

I understand that the schema options allow for a specific granularity only. After the few hours have passed we don’t mind if the data for the whole hour is stored aggregated as a minute. What I mean is suppose we have the following

6:01 AM = 5

6:02 AM = 2

and so on till

7:00 AM = 1

After the X hours have passed we don’t mind if the data is changed to

6:01 AM = Sum of 6:01 to 7:00 AM.

and all others are deleted.

Hi,

You have 2 options.

  1. You can store events from kafka to hdfs (in addition to sending to druid). After “few” hours, you can use batch ingestion to “reindex” the interval with different granularity. New indexed data will completely overshadow the older segments.

  2. After “few” hours you can use IngestSegmentFirehose (http://druid.io/docs/latest/ingestion/faq.html#how-can-i-reindex-existing-data-in-druid-with-schema-changes ) to reindex with new granularity. Note that this will do all the processing related to re-indexing on the druid middle manager node, so not too great if you have a lot of data.

  3. Not really available now, but, we are working on to provide IngestSegmentFirehose kind of feature via hadoop too.

I would recommend using (1) for now.

– Himanshu