Trouble shooting Kafka-indexor (unavailable segments)

Hi all, slowly getting to know/understand druid/kafka a bit more.

Anyways…My last test I loaded approx 2.5B records from a kafka cluster (representing 1 month of data)

I ‘killed’ one of my kafka nodes (to avoid diskspace issues)…(kafka is running with R=2).

Now…I discovered one of my kafka broker nodes had a misconfigured DNS in addition to my down node.

so…now after all is said and done, we have about 8000 segments online, and 300 unavailable (looking at the segment log file, it shows it cannot get to the kafka node (bad DNS, and then the offline node (was taken offline to prevent disk issue).

I see errors in the task log showing errors/warning getting to the kafka nodes in question (pointing to the DNS issue)

So I re-posted my indexing-task,but corrected the bootstrap list…should this in theory ‘fix’ those segments that were ‘stuck’ as unavailable…since now it should be able to ‘stream’ from that downed node.

I suspect this will not ‘magically’ go back and re-stream from those broken segments. (even though in theory…each segment should correspond to a Kafka Partition/offset).

So now…I suppose the trick to fix this…(recover the bad segments), is to:

  1. Identify all ‘time-windows’ for all 300 segments

  2. Ideally: (if possible…identify which kafka partitions/offset referenced in bad segments were being used)…(then could stream only thoses to ‘file’ then updated via index task)

  3. download the data ‘representing’ the time windows in question to file, then re-index the entire time-window (for each of the 300 segments)

I don’t actually need to do any of this…as this is just test data, but trying to understand if this approach is sound, or if there is/are better ways get the data that got missed in this scenario.

Daniel,

The cleanest way to recover is to do steps 1) and 3) as you outlined below. Check to see if you can also delete the 300 segments.

Rommel Garcia