Understanding the "persist" functionality of Kafka indexing

In the documentation for Kafka indexing, several knobs such as “maxRowsInMemory” refer to “persisting”.

Is this “persistence” merely about bounding the memory usage within a single indexing task? Or is the persisted data also used to recover from crashed indexing tasks without having to start over from scratch?


Hey Dave,

I think the “maxRowsInMemory” is used to temporarily (locally) persist Druid generated files, avoiding OOM - they will be copied into the deep storage (S3) only after the tasks completes successfully. So, I’ll say YES, it’s about the memory usage on the middle managers nodes, NO, it’s not used to recover (it will start from scratch)


Hmm, actually I’m experimenting and it seems like the ingester is able to recover from persisted files, if the files are still there!

this is interesting - how you manage to do this?
In my case, checking in the logs the starting offset - for a job failure, the new job will keep the same startPartitions excerpt, like:

  "ioConfig" : {
    "type" : "kafka",
    "taskGroupId" : 4,
    "baseSequenceName" : "index_kafka_[topic]_84e9aa36ea0df72",
    "startPartitions" : {
      "topic" : "[topic]",
      "partitionOffsetMap" : {
        "4" : 8841723833,
        "10" : 8841588818