Kafka Indexing Service & Late Data

The docs for the Kafka Indexing Service (http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html) state:

able to read non-recent events from Kafka and are not subject to the window period considerations imposed on other ingestion mechanisms

How does this work with segments being immutable?

For example, if a segment is created that covers the 14:00-15:00 hour and handed off, and then a task is created for 15:00-16:00, what happens with data that arrives that is timestamped between 14:00-15:00?


The partition set for hour 14-15 is expanded to have an extra segment
added to it, the data point is added to that segment and it is
eventually handed off. This can produce a trail of small segments if
it happens regularly, but reindexing either from the raw data or from
the segments themselves on a regular cadence can be used to fix that.


Thanks Eric.

I’m not familiar with the term partition set as it relates to Druid nor did I see any mention of that in the documentation. Will you please explain what this is?

If a single event arrives for the 14-15 window, will a new segment be created just for that event (assuming no more arrive)?

Essentially, yes, a new segment would be created just for that one event. That’s what I meant by “a trail of small segments”.

When you have multiple tasks ingesting data in parallel for the same time period (that are not replicas), the collection of segments represented by those tasks is what I was referring to as the partition set. We also call each of them "shard"s so the “shard set” would also be a reasonable term.


Thanks Eric. I didn’t realize that there could be multiple segments covering the same time period but I guess that makes sense given that multiple Kafka partitions will certainly represent data from the same time period and each partition will have its own segments.