Kafka indexing task - where is data stored while in reading state

While a Kafka indexing task is running and in reading state, where is data pushed for indexing stored? This question comes up because I am wondering where and how much resources to allocate if a long taskDuration is used.

Hey Jason,

A “maxRowsInMemory” amount is stored in heap (although in 0.9.1 this limit was not applied correctly; this is fixed for 0.9.2). Beyond that, data is stored on the local disk of the middleManager running the indexing task. At the end of the taskDuration, all pending data is pushed and handed off to historical nodes, and the task exits.

Hey Linbo,

  1. Ah, this will depend on https://github.com/druid-io/druid/issues/3241. For a workaround you could modify the source to remove the NoneShardSpec branch in DetermineHashedPartitionsJob.java.

  2. Definitely, things work best if segmentGranularity is oriented around actual data size. For most people, HOUR or DAY is best. Another thing to keep in mind with Kafka indexing is that you get at least one segment for every Kafka partition, so if you have too many Kafka partitions, then you can get a lot of small segments.

  3. Your understanding is right. Data is pushed to deep storage (and handed off to historicals) at the end of each task, so taskDuration has a direct impact on how much data is kept in the realtime system. Usually a duration similar to, or larger than, your segmentGranularity works best.

  4. In this case, Druid should use the fifteen_minute segment for time ranges where they already exist, and create larger granularity segments for new ranges. If this doesn’t seem to be working right then please let us know.