Batch ingestion expected behavior with intervals that fall between segment granularity

Hi,

We are currently interested in implementing a batch ingestion flow in order to overwrite old data with new/missing data. We currently use an hourly segment granularity.

If I send a batch job with an interval that falls between an hour, such as from 2017-01-01T10:30 to 2017-01-01T11:30 will only the relevant interval be overwritten? Or will all segments being accessed be completely overwritten?

Meaning will ALL the data in segments 10:00-11:00 and 11:00-12:00 be overwritten with the new data? Or can I be safe to say that only the data in the specified timestamp will be overwritten?

More so, is it better practice to create one big job or split it up into smaller jobs (the size being of 1 hour’s worth of data)?

Thanks in advance,

Itamar

Hi,

We are currently interested in implementing a batch ingestion flow in order to overwrite old data with new/missing data. We currently use an hourly segment granularity.

If I send a batch job with an interval that falls between an hour, such as from 2017-01-01T10:30 to 2017-01-01T11:30 will only the relevant interval be overwritten? Or will all segments being accessed be completely overwritten?

the entier segment(s) that fall(s) into that interval will be rewritten AKA you lose all the previous data and you get only the new data

Meaning will ALL the data in segments 10:00-11:00 and 11:00-12:00 be overwritten with the new data? Or can I be safe to say that only the data in the specified timestamp will be overwritten?

Yes in that case you need to use delta ingestion where you specify as input the new data and the old druid segments

http://druid.io/docs/latest/ingestion/update-existing-data.html

More so, is it better practice to create one big job or split it up into smaller jobs (the size being of 1 hour’s worth of data)?

it depends what you call big job it is subjective. I guess you should think about what are the requirement of your pipeline and index data based on that.

What matters for druid is to index the data within the same segment granularity at the same time like that you minimize the IO by avoiding the delta ingestion phase

Thanks Slim. I’m interested in overwriting the segments. As a side question, is the interval specification inclusive or exclusive? Should I specify the hour as [00:00:00.000 to 01:00:00.000] or [00:00.00.000 to 00:59:59.999] when overwriting the hourly segment?