Help with Kafka index tasks in failed state

Hello,

I have an issue with Druid. I have a stream on Kafka and I subscribe to the topic through a supervisor.
The tasks are created and sometimes fail, although they do fail, the segments appear in the historical node and the queries are resolved so the segments are stored correctly.
I’m using Druid 0.16.1-incubating version and the last lines of the peon log, peon status and peon report are attached.

Could you help me get the tasks to end up in SUCCESS status instead of FAILED and understand what happens to fix it?

Thanks!

peon_log.txt (15.8 KB)

report.json (237 Bytes)

status_peon.json (451 Bytes)

Hi Luis Gomez,

Please check the MiddleManager and Overlord log for the failed task which should give you more details.

Thanks and Regards,

Vaibhav

Hello,

Attached are the logs of MiddleManager, Overlord and the peon task.

Could you help me see what’s going on and why the indexing task is failing?

Thank you!

overlod.txt (120 KB)

peon_task_log.txt (310 KB)

middle_manager.txt (5.07 KB)

Hi Luis ,
For the attached peon task the middle-manager and Overlord log do not have any details as they seems incomplete . Peon task has completed at 2020-01-14T13:47:24,929 however the middlemanager and overlord log has logging till 2020-01-14 12:47 only.
However, I looked into the one of the old kafka indexing task for supervisor :[KafkaSupervisor-rt-idbox]. I see below error in the overlord log:
Task-Id: index_kafka_rt-idbox_4ed329169eec831_hcbhclbm
2020-01-14 12:46:39.004,“2020-01-14T12:46:39,004 INFO [KafkaSupervisor-rt-idbox] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id=‘rt-idbox’, generationTime=2020-01-14T12:46:39.004Z, payload=KafkaSupervisorReportPayload{dataSource=‘rt-idbox’, topic=‘rt-idbox’, partitions=1, replicas=2, durationSeconds=1800, active=[{id=‘index_kafka_rt-idbox_1257830c5cc7315_dclelgnc’, startTime=2020-01-14T12:17:05.024Z, remainingSeconds=26}, {id=‘index_kafka_rt-idbox_1257830c5cc7315_dknffpaf’, startTime=2020-01-14T12:17:07.505Z, remainingSeconds=28}], publishing=[{id=‘index_kafka_rt-idbox_4ed329169eec831_hcbhclbm’, startTime=2020-01-14T11:46:56.488Z, remainingSeconds=18}, {id=‘index_kafka_rt-idbox_4ed329169eec831_adfklccf’, startTime=2020-01-14T11:46:56.066Z, remainingSeconds=18}], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=}}”

2020-01-14 12:47:06.116,“2020-01-14T12:47:06,116 ERROR [KafkaSupervisor-rt-idbox] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - No task in [[index_kafka_rt-idbox_4ed329169eec831_hcbhclbm, index_kafka_rt-idbox_4ed329169eec831_adfklccf]] for taskGroup [0] succeeded before the completion timeout elapsed [PT1800S]!: {class=org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor}”

2020-01-14 12:47:06.116,“2020-01-14T12:47:06,116 INFO [KafkaSupervisor-rt-idbox] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_rt-idbox_4ed329169eec831_hcbhclbm] because: [No task in pending completion taskGroup[0] succeeded before completion timeout elapsed]”

Kafka Indexing tasks are supposed to finish a task within the completion timeout. If they won’t,The Kafka-supervisor assumes that there are some problems and issue a kill/shutdown signal to the tasks, that’s what seems has happened here.
A running task will normally be in one of two states: reading or publishing. A task will remain in reading state for taskDuration, at which point it will transition to publishing state. A task will remain in publishing state for as long as it takes to generate segments, push segments to deep storage, and have them be loaded and served by a Historical process (or until completionTimeout elapses).
The length of time to wait before declaring a publishing task as failed and terminating it. If this is set too low, your tasks may never publish. The publishing clock for a task begins roughly after taskDuration elapses.

For now, Please increase the completion timeout to 60 minutes [ i.e PT60M] and see if that helps.

Additionally, I will suggest you to go through below link to fine-tune your kafka ingestion:

https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html
https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html#capacity-planning

Thanks and Regards,
Vaibhav