Data schema migrations

Hello,

I’m trying to figure out how to migrate data. Lets say I want to add a new dimension. segmentGranularity is set to day.

Lets assume the current time is “2015-10-14T12:10:00Z”

  1. I’ll stop the current realtime task

  2. Run the batch ingestion task with interval “2014/2015-10-14T12:10:00Z”

  3. Start the realtime task again

After few minutes I notice that all the events from “2015-10-14T00:00:00Z/2015-10-14T12:10:00Z” are gone. I guess this is because the realtime index task takes priority because the segmentGranularity is set to day.

I can see two options how to solve it:

  1. Step by step data migration: start sending events with the new dimension, then do the batch ingestion a day later and after that switch the app to use the new dimension.

  2. Reduce the segmentGranularity to an hour. This way I lose maximally an hour worth of events, but these can be ingested through the batch ingestion task an hour later.

Obviously the best option is the (1) approach. However, sometimes it would be better to use a quicker route. Is it possible without setting the segementGranularity to hour?

Thanks,

Indrek

I thought a bit more about this. Does it make sense if the realtime index task is using hourly segments and there’s a recurring task that merges previous segments? Pretty much converting them into daily segments?

Could I use the “Merge Task” for that http://druid.io/docs/latest/misc/tasks.html ? Or should I do batch ingestion instead?

Hi Indrek,

You don’t have to reprocess any of older interval to add new dimension. All the queries asking for new dimension on older intervals will get empty value by default for same. (you can change the default by using

Are you reprocessing because you really want to fill in old intervals with non-empty values for the new dimension?
In that case, best thing would be to do batch ingestion of existing interval only after realtime is done for that interval. For things to be faster, you could choose to have segmentGranularity be HOUR.

For “merging” the segments generated from realtime (or whatever), you can turn on automatic merging on the coordinator (via property druid.coordinator.merge.on , see http://druid.io/docs/latest/configuration/coordinator.html ) because using “merge task” manually might be difficult as it needs list of segments.
or
you can use dataSource path spec to do merging/reindexing (see http://druid.io/docs/latest/ingestion/batch-ingestion.html#dataSource ). however druid-0.8.1 will need to be patched with https://github.com/druid-io/druid/pull/1797 or use druid-0.8.2-rc which will be available soon.

– Himanshu

Thank you,

I didn’t know about the druid.coordinator.merge.on option. This solves my problems.

Indrek

Please note that coordinator merging is limited and can not handle a segment granularity interval if it contains more than one shard and also all the processing will happen inside peons on middler manager (or overlord based on whether you are running it in local/remote mode).

– Himanshu