Kafka-indexer and out of order data?

Hi all, trying to understand segments/kafa-indexer flows.

Lets say i have a kafka topic, with events like:

event1, time=2019-01-01 00:00:00

event2, time=2019-01-01 00:14:00

event3, time=2019-01-01 00:15:00

event4, time=2019-01-01 00:14:20

We are ingesting with 15 minute granularity…

I assume the following would happen:

events 1-2, will be placed into segment: (something like)

SEGMENT 1: test_2019-01-01T00:00:00.000Z_2019-01-01T00:15:00.000Z_2019-05-08T11:58:12.052Z

now when we get event3, druid will determine it need to go to the next segment:

something like:

SEGMENT 2: test_2019-01-01T00:15:00.000Z_2019-01-01T00:35:00.000Z_2019-05-08T11:58:12.052Z

it will likely commit the first segment…(immutable at this point).

so when it gets event4 what will happen??

will put it into segment 2? and somehow keep and index indicating segment2 has ‘data’ overlapping?

or does it…drop the event?

Thanks

Dan

Hi Daniel,

Kafka indexing service provides exactly-once ingestion and data would be never missing or duplicated.

In kafka indexing service, whenever a new event is found which cannot be added to any actively generating segments, a new segment will be created for the event.

In your example, a new segment 3 would be created for event4 which would have the segmentId of test_2019-01-01T00:00:00.000Z_2019-01-01T00:15:00.000Z_2019-05-08T11:58:12.052Z_1 (1 is added at the end of the id). The new segment would have the same interval and version, but be distinguished with the unique suffix (we call it partitionId).

Jihoon

Ahh, that makes perfect sense, thanx for the explaination.

In that scenario, we could then have 1 segment with just 1 event.

Now with compaction rules running, that would eventually be merged with other segments…without compaction rules…you could potentially end up with lots of little segments over time, hurting query performance.

Yes, exactly!