Understanding Segment Merge Task

I see that segment merge task is for creating a new segment from a set of possibly overlapping (but non-sharded) segments. I am curious in what case two segments would have overlapping timestamps but still be merge-able (if they are in different segments doesn’t this mean the hash / dimensional values are different, and thus the records themselves are different (i.e. not merge-able)?)

Is this for merging segments from the same interval, but with a different version?

Having looked more: if there are enough records in the segment for a time interval it can be sharded out by hash or by dimensional value.

Given this, It seems that the only way to have non-sharded segments that cover the same time interval is if they are from different versions of the interval. Is this correct?

The merge task was initially created to merge many small segments into a single large segment. One way to create many small segments is if you do not have a lot of data and are creating hourly realtime segments.Given Druid’s parallelization model is to have one thread scan one segment at a time, we recommend having segments between 250-800mb in size. The merge task is one way to ensure that segments in your cluster are always roughly this size.

Gotcha, this makes sense. Thanks

To clarify further:

In the case that you describe (many small segments due to low data volume, and hourly real-time segment creation). If the realtime nodes are handing off segments a window period after the segment time interval ends, how do the segments end up overlapping in time? I guess my question is if a segment is synonymous with a time interval, how can two segments overlap in time?

Hi Michael,

fwiw, MergeTask can be used to merge smaller segments with adjacent intervals, e.g If you have 24 small segments one for each hour of the day,

you can use mergeTask to create a single segment whoose interval would be complete day. does that clarify your doubt ?

Thanks for the reply Nishant. This seems like a use case for the append segment task. The difference being with merge “Any common timestamps are merged”. In the example you provide, those 24 adjacent segments should not have any common timestamps.

It is not clear to me how two distinct segments would have common timestamps unless the query granularity is being simultaneously reduced. E.g.

24 segments with hourly segment granularity and hourly query granularity ==> 1 segment with daily segment granularity and hourly query granularity (none of the records in this merge should have overlapping timestamps)


24 segments with hourly segment granularity and hourly query granularity ==> 1 segment with daily segment granularity and daily query granularity (records could have overlapping timestamps (given the reduced query granularity))