Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

  • Druid Version: 0.22.1
  • Kafka Ingestion (idempotent producer)

We recently started having intermittent problems with Kafka tasks failing but seems irregular because the Peon tasks logs shows “SUCCESS” but the Overlord marks the task as “FAILED”.

From what I can see, it appears that the Peon Task says SUCCESS and then Overlord attempts to reach out to the Peon via HTTP but since the Peon task already shows SUCCESS and exits/shutdowns, Overlord calls fail and eventually marks the task as “FAILED”.

Peon Log:

2022-05-25T12:11:00,996 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status
...
...
...
2022-05-25T12:11:01,199 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]
Finished peon task

Overlord Logs:

2022-05-25T12:11:13,393 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT2S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:15,396 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT4S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:16,042 INFO [IndexTaskClient-vrops-0] org.apache.druid.indexing.common.IndexTaskClient - submitRequest failed for [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/offsets/current], with message [Connection refused (Connection refused)]
2022-05-25T12:11:19,398 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT8S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:27,404 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT10S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:37,407 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT10S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:40,377 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [An exception occured while waiting for task [index_kafka_vrops_a9f227794bcd31c_nohagige] to pause: [org.apache.druid.java.util.common.IAE: Received 400 Bad Request with body: Can't pause, task is not in a pausable state (state: [PUBLISHING])]]
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from activeTasks
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T07:00:00.000Z/2022-05-25T08:00:00.000Z, version='2022-05-25T08:46:31.603Z', priority=75, revoked=false}]
2022-05-25T12:11:40,390 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T08:00:00.000Z/2022-05-25T09:00:00.000Z, version='2022-05-25T09:40:09.046Z', priority=75, revoked=false}]
2022-05-25T12:11:40,401 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T09:00:00.000Z/2022-05-25T10:00:00.000Z, version='2022-05-25T10:21:47.604Z', priority=75, revoked=false}]
2022-05-25T12:11:40,411 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T10:00:00.000Z/2022-05-25T11:00:00.000Z, version='2022-05-25T11:16:12.450Z', priority=75, revoked=false}]
2022-05-25T12:11:40,416 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_vrops_a9f227794bcd31c_nohagige to status: TaskStatus{id=index_kafka_vrops_a9f227794bcd31c_nohagige, status=FAILED, duration=-1, errorMsg=An exception occured while waiting for task [index_kafka_vrops_a9f227794bcd31c_nohagige] to pause: [...}
2022-05-25T12:11:40,420 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskQueue - Task done: AbstractTask{id='index_kafka_vrops_a9f227794bcd31c_nohagige', groupId='index_kafka_vrops', taskResource=TaskResource{availabilityGroup='index_kafka_vrops_a9f227794bcd31c', requiredCapacity=1}, dataSource='vrops', context={checkpoints={"0":{"4":150881239982}}, useLineageBasedSegmentAllocation=true, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, forceTimeChunkLock=true}}
2022-05-25T12:11:44,290 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_vrops_properties_ec5efd099c7bc54_dkniaoek, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_ecscap_542f2ab1f65d0f9_fmfpedff, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_vrops_properties_0da27f4d850fe43_fmdheibd, index_kafka_vrops_properties_4437f8b14c8540c_apggadoc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf, index_kafka_vrops_properties_4714686ef85e5f6_ebbjffno]]]
2022-05-25T12:11:44,298 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:44,514 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_vrops_properties_ec5efd099c7bc54_dkniaoek, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_ecscap_542f2ab1f65d0f9_fmfpedff, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_vrops_properties_4437f8b14c8540c_apggadoc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf, index_kafka_vrops_properties_4714686ef85e5f6_ebbjffno]]]
2022-05-25T12:11:44,518 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:44,772 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_vrops_properties_ec5efd099c7bc54_dkniaoek, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_ecscap_542f2ab1f65d0f9_fmfpedff, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:44,775 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,058 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_vrops_properties_ec5efd099c7bc54_hnhdnfkp, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:45,067 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,374 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_vrops_properties_ec5efd099c7bc54_hnhdnfkp, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:45,379 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,634 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_vrops_properties_ec5efd099c7bc54_hnhdnfkp, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:45,638 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[ligplbyjchbtcgg.amer.replaceddomain.com:8091] wrote SUCCESS status for task [index_kafka_vrops_a9f227794bcd31c_nohagige] on [TaskLocation{host='ligplbyjchbtcgg.amer.replaceddomain.com', port=8101, tlsPort=-1}]
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[ligplbyjchbtcgg.amer.replaceddomain.com:8091] completed task[index_kafka_vrops_a9f227794bcd31c_nohagige] with status[SUCCESS]
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Received SUCCESS status for task: index_kafka_vrops_a9f227794bcd31c_nohagige
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [notified status change from task]
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_kafka_vrops_a9f227794bcd31c_nohagige] on worker[ligplbyjchbtcgg.amer.replaceddomain.com:8091]
2022-05-25T12:11:45,690 WARN [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Unknown task completed: index_kafka_vrops_a9f227794bcd31c_nohagige
2022-05-25T12:11:45,690 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Task SUCCESS: AbstractTask{id='index_kafka_vrops_a9f227794bcd31c_nohagige', groupId='index_kafka_vrops', taskResource=TaskResource{availabilityGroup='index_kafka_vrops_a9f227794bcd31c', requiredCapacity=1}, dataSource='vrops', context={checkpoints={"0":{"4":150881239982}}, useLineageBasedSegmentAllocation=true, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, forceTimeChunkLock=true}} (4086059 run duration)
2022-05-25T12:11:45,690 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task[index_kafka_vrops_a9f227794bcd31c_nohagige] went bye bye.

Based off the logs, it appears that Peon SUCCESS status was not recognized by the Overlord in time (or polled in time - unsure how SUCCESS status gets passed to the Overlord from a PEON). It realizes it got a SUCCESS status but only after it tried to shut it down multiple times which regardless appears to mean it marks it as failure.

Can someone help me understand how SUCCESS statuses from Peons to Overlords are made? Also help me understand why this is occurring and how we can resolve it? We have a taskDuration: 1 hour and completionTimeout: 30min at the moment. We did try playing around with completionTimeout in the past but I believe theses issues persisted.

Thanks,
Peter

Hi Peter,

Are you running the Overlord in remote mode?

My initial guess is that you have to adjust properties in the overlord/runtime.properties file, possibly druid.indexer.runner.taskAssignmentTimeout.

Best,

Mark

Hi Mark,

Appreciate the response and yes we are running in remote mode.

That property says: “How long to wait after a task as been assigned to a MiddleManager before throwing an error”

Could you explain a little bit on why you believe this would help? Just trying to understand the thought process here and make sure I am thinking the same. In the Overlord logs, it shows:

2022-05-25T12:03:03,794 INFO [IndexTaskClient-vrops-3] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Task [index_kafka_vrops_a9f227794bcd31c_nohagige] paused successfully
2022-05-25T12:11:13,393 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT2S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:15,396 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT4S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:16,042 INFO [IndexTaskClient-vrops-0] org.apache.druid.indexing.common.IndexTaskClient - submitRequest failed for [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/offsets/current], with message [Connection refused (Connection refused)]
2022-05-25T12:11:19,398 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT8S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:27,404 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT10S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:37,407 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT10S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:40,377 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [An exception occured while waiting for task [index_kafka_vrops_a9f227794bcd31c_nohagige] to pause: [org.apache.druid.java.util.common.IAE: Received 400 Bad Request with body: Can't pause, task is not in a pausable state (state: [PUBLISHING])]]
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.dell.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from activeTasks
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T07:00:00.000Z/2022-05-25T08:00:00.000Z, version='2022-05-25T08:46:31.603Z', priority=75, revoked=false}]
2022-05-25T12:11:40,390 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T08:00:00.000Z/2022-05-25T09:00:00.000Z, version='2022-05-25T09:40:09.046Z', priority=75, revoked=false}]
2022-05-25T12:11:40,401 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T09:00:00.000Z/2022-05-25T10:00:00.000Z, version='2022-05-25T10:21:47.604Z', priority=75, revoked=false}]
2022-05-25T12:11:40,411 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T10:00:00.000Z/2022-05-25T11:00:00.000Z, version='2022-05-25T11:16:12.450Z', priority=75, revoked=false}]
2022-05-25T12:11:40,416 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_vrops_a9f227794bcd31c_nohagige to status: TaskStatus{id=index_kafka_vrops_a9f227794bcd31c_nohagige, status=FAILED, duration=-1, errorMsg=An exception occured while waiting for task [index_kafka_vrops_a9f227794bcd31c_nohagige] to pause: [...}

So if I understood your logic correctly, you are thinking that it marked the task as FAILED since some action sent to MM was not executed within the default PT5M? In this case, you are thinking since the task already finished with SUCCESS - it would have gotten the SUCCESS status instead of failing it if we waited ~1 more minute (according to logs)?

That also makes me wonder how does the Overlord get the SUCCESS status from a PEON exactly? ZK, Metadata/Sql, etc? Is it polling from Overlord/pushing from Peon? Couldn’t we also try to change that frequency if available?

Thanks

We are experiencing something similar, on middlemanager, task is getting finished with SUCCESS status but later on getting tagged as failed in middlemanager as well as in coordinator logs. Is there any further findings on this case?

@Jvalant_Patel,

We are still experiencing this issue (comes and goes for us), I tried messing around with the setting that @Mark_Herrera gave and also druid.indexer.runner.taskShutdownLinkTimeout setting without luck. I also tried increasing values for properties: druid.global.http.numConnections and druid.global.http.numMaxThreads for overlord/coordinator and MM without success (was seeing if it was an issue with communications failing due to busy services) and also occassionally seeing errors like:

Bad response HTTP [no response] from [http://REDACTED:8103/druid/worker/v1/chat/index_kafka_redacted_27b6f47201f8230_ggkmecbl/status]; will try again in [PT4S] (body/exception: [Connection refused (Connection refused)])

But based off the logs, the task was already stopped (usually) and that’s why the connection refused happens. The overlord doesn’t realize the status of it yet - hence I am asking how exactly the overlord receives the status from peon (it appears to be HTTP) and if there is anyway we can debug that process a little better. Trying to figure out if there are configurable options for that step that may help us like number of retries, etc.

I’m unsure what to try next but at times it does recover on it’s own or sometimes with restarts of all components but it’s not consistent on recovery

For further troubleshooting info/background, we are running the following set up:

  • 1 Router
  • 2 Brokers
  • 2 Coordinators
  • 2 Overlords
    • druid_maxdirectmemorysize: 10500m
    • druid_xms: 15g
    • druid_xmx: 15g
    • druid.indexer.runner.type: remote
    • druid.indexer.storage.type: metadata
    • druid.global.http.numConnections: 100 (I know this is probably way too high but was trying to eliminate HTTP issues as potential cause - usually set to 50-60)
    • druid.global.http.numMaxThreads: 100 (I know this is probably way too high but was trying to eliminate HTTP issues as potential cause - usually set to default on 16 core node which is 49 according to formula)
    • druid.indexer.runner.taskAssignmentTimeout: PT10M (adjusted from default PT5M based on Mark’s recommendation)
    • druid.indexer.runner.taskShutdownLinkTimeout: PT3M (adjusted from default PT1M based on Mark’s recommendation)
  • 10 Historicals with 1.8 TB storage
  • 14 MM with 10 peon slots (we don’t use more than 2-3 on each usually, just for future scaling)
    • 4GB heap per peon

Historicals and MM are on 16 CPU / 128 GB VMs, resource usage is relatively low (cpu and memory usually below 50%) and also looking at metrics (we export using prometheus), jvm usage is also within bounds

I’m still looking into this issue but it’s hard to tell. What I did notice is that since we do have a decent amount of data (20 TB+), when we run all tasks at the same time for supervisors, overlord appears to run in a degraded state. It will take minutes do things that usually take seconds and run everything in a delayed manner.

For example, overlord requesting a new task to run, normally is fast:

2022-06-21T18:08:49,254 INFO [KafkaSupervisor-REDACT] org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem with status: TaskStatus{id=index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem, status=RUNNING, duration=-1, errorMsg=null}
2022-06-21T18:08:49,259 INFO [KafkaSupervisor-REDACT] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem] to activeTasks
2022-06-21T18:08:49,259 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem

But during times when we have issues, it can take minutes, example 3 min delay between task insertion and running:

2022-06-21T18:08:49,254 INFO [KafkaSupervisor-REDACT] org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem with status: TaskStatus{id=index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem, status=RUNNING, duration=-1, errorMsg=null}
2022-06-21T18:11:49,259 INFO [KafkaSupervisor-REDACT] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem] to activeTasks
2022-06-21T18:11:49,259 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem

Not sure if the above is the best example to show Overlord slowness since there are other services involved but for the specific example above, we have metrics being emitted to our prometheus server and sql query response times by other druid services seem fine during (usually in ms range) the times we have this specific issue.

It might be a configuration option but we do have these overlords on nodes with enough resources and they do not consume all resources they are allocated. Based off this behavior, it appears that the overlord is a bottleneck for our tasks to run to completion based off the general slowness I see from it when we are running all tasks at the same time (we actually don’t have that many tasks running when compared to some more bigger setups I’m sure, 26 concurrent tasks on normal load and +3-6 for compaction that may run)

What I found that could somewhat help (for us we are back to almost 100% - occasional failure or two) is offsetting your supervisor task start times. For example, for data source #1 you would start that first, then start data source #2 after ~15-30 min later, and so on and so on. My thought process for trying this is that during heavy overlord usage (task completion and status READING/PUBLISHING/etc. changes) we can try not have all tasks hit those same time windows rather offset the windows where they rely on overlord heavily.

It’s not ideal but I’m still looking into the issue when I have time and seeing if the community here has any recommendations/thoughts.

Updated previous post with more configuration for visibility/troubleshooting

We are still facing this issue, even if we start supervisors tasks in a offset manner as I mentioned before. We do hourly kafka ingestion tasks and after the first 1-2 hours of tasks, things start to run in a degraded state at least between Overlord and MM/Peon (and possibly Metadata DB since those actions seem to take a while but unsure at the moment if it’s the DB being degraded as well or if the Druid services are just taking a while to complete DB operations).

Things hold up fine actually if we don’t enable all supervisors and run just a few so this is indicating a load issue on one+ of those 3 services (Overlord, MM, Peon - and I guess possibly DB). Once we turn on all tasks, things start to fail and a lot of different types of errors show up like:

Kafka empty offsets which is weird because we are always sending data and it does run tasks again if we lower the amount of supervisors running

"errorMsg": "Task [index_kafka_REDACT_4c6fc601148dfa7_gbaamibh] returned empty offsets after pause"

Even an NPE which I haven’t seen before

022-06-22T03:23:51,896 INFO [main] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Stopping forcefully (status: [STARTING])
2022-06-22T03:23:51,896 ERROR [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Exception when stopping method[public void org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop()] on object[org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner@24954e82]
java.lang.reflect.InvocationTargetException: null
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_322]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_322]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_322]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_322]
    at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:465) [druid-core-0.22.1.jar:0.22.1]
    at org.apache.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:368) [druid-core-0.22.1.jar:0.22.1]
    at org.apache.druid.cli.CliPeon.run(CliPeon.java:323) [druid-services-0.22.1.jar:0.22.1]
    at org.apache.druid.cli.Main.main(Main.java:113) [druid-services-0.22.1.jar:0.22.1]
Caused by: java.lang.NullPointerException
    at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.stopForcefully(SeekableStreamIndexTaskRunner.java:1383) ~[druid-indexing-service-0.22.1.jar:0.22.1]
    at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.stopGracefully(SeekableStreamIndexTask.java:161) ~[druid-indexing-service-0.22.1.jar:0.22.1]
    at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop(SingleTaskBackgroundRunner.java:186) ~[druid-indexing-service-0.22.1.jar:0.22.1]
    ... 8 more

And also Task assignment timeout (even after increasing it to PT10M) - which indicates a problem with one+ of those 3 services not doing their job in a timely manner and running in a degraded state. We often see tasks in “WAITING” status for 3-5+ minutes as they sit there either to go to SUCCESS/FAIL status or start their actual work/“RUNNING”

2022-06-22T02:11:04,016 ERROR [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task assignment timed out on worker [REDACT.com:8091], never ran task [index_kafka_REDACT_01225427fd2835b_bkaddgea]! Timeout: (660669 >= PT10M)!: 

I know other Druid users probably have much more volume than we do (We have ~21 TB total data available in deep storage/Historicals and process ~400,000,000 metrics an hour), so this is weird since we don’t expect to overload the Druid processes with our current compute and set up. And as discussed, general resource use of Druid services doesn’t exceed thresholds that are set.

I’m unsure of what to try next at the moment, but thinking doing a debug log_level investigation would be a next first step.