Zero byte segments & realtime ingestion questions

Hi,

I’m using Kafka indexing service extension.

One of our datasources didn’t retrieve new data due to a Kafka issue and didn’t have any running tasks.

Now, and after resolving it, I see that the total size of the newly created segments is “0 B” (zero bytes).

I also don’t see these segments in druid_segments in the metadata store.

Is it because all these segments are still being queried from the stream ingestion tasks rather than from the historicals?

Second, it’s been a few hours already and segmentGranularity=HOUR, so I’m getting more 0 bytes segments each hour.

How come none of the previous-hour segments has been committed? Could it be because the thresholds (maxRowsPerSegment, intermediateHandoffPeriod) hasn’t been reached yet?

Third, what happens if a task fails? (i.e. if the middleManager goes down)?

My guess is that the data will be lost since it’s already been pulled from Kafka and not have been committed to deep storage yet. is it a correct assumption?

Thanks!

Eyal.

Hi Eyal,

  1. You’re correct. Segments of 0 bytes are served by kafka tasks and not published yet. Once they are published, the segment size should be updated properly.

  2. I think your tasks haven’t meet any conditions yet including maxRowsPerSegment, intermediateHandoffPeriod, and taskDuration.Each kafka task reads data and generates segments during ‘taskDuration’. For taskDuration, they can intermittently publish segments if one of maxRowsPerSegment or intermediateHandoffPeriod is met. After taskDuration, kafka tasks enter the ‘publishing’ mode and starts publishing segments which are not published yet.

  3. If one of task fails, all data which is not ‘published’ is lost in Druid. However, the supervisor would spawn a new task which would read from the offset where the failed task stopped publishing.

So, as long as your data is in Kafka, Druid will read all of them without loss.

Jihoon