Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

  • Druid Version: 0.22.1
  • Kafka Ingestion (idempotent producer)

We recently started having intermittent problems with Kafka tasks failing but seems irregular because the Peon tasks logs shows “SUCCESS” but the Overlord marks the task as “FAILED”.

From what I can see, it appears that the Peon Task says SUCCESS and then Overlord attempts to reach out to the Peon via HTTP but since the Peon task already shows SUCCESS and exits/shutdowns, Overlord calls fail and eventually marks the task as “FAILED”.

Peon Log:

2022-05-25T12:11:00,996 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status
...
...
...
2022-05-25T12:11:01,199 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]
Finished peon task

Overlord Logs:

2022-05-25T12:11:13,393 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT2S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:15,396 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT4S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:16,042 INFO [IndexTaskClient-vrops-0] org.apache.druid.indexing.common.IndexTaskClient - submitRequest failed for [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/offsets/current], with message [Connection refused (Connection refused)]
2022-05-25T12:11:19,398 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT8S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:27,404 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT10S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:37,407 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT10S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:40,377 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [An exception occured while waiting for task [index_kafka_vrops_a9f227794bcd31c_nohagige] to pause: [org.apache.druid.java.util.common.IAE: Received 400 Bad Request with body: Can't pause, task is not in a pausable state (state: [PUBLISHING])]]
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from activeTasks
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T07:00:00.000Z/2022-05-25T08:00:00.000Z, version='2022-05-25T08:46:31.603Z', priority=75, revoked=false}]
2022-05-25T12:11:40,390 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T08:00:00.000Z/2022-05-25T09:00:00.000Z, version='2022-05-25T09:40:09.046Z', priority=75, revoked=false}]
2022-05-25T12:11:40,401 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T09:00:00.000Z/2022-05-25T10:00:00.000Z, version='2022-05-25T10:21:47.604Z', priority=75, revoked=false}]
2022-05-25T12:11:40,411 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T10:00:00.000Z/2022-05-25T11:00:00.000Z, version='2022-05-25T11:16:12.450Z', priority=75, revoked=false}]
2022-05-25T12:11:40,416 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_vrops_a9f227794bcd31c_nohagige to status: TaskStatus{id=index_kafka_vrops_a9f227794bcd31c_nohagige, status=FAILED, duration=-1, errorMsg=An exception occured while waiting for task [index_kafka_vrops_a9f227794bcd31c_nohagige] to pause: [...}
2022-05-25T12:11:40,420 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskQueue - Task done: AbstractTask{id='index_kafka_vrops_a9f227794bcd31c_nohagige', groupId='index_kafka_vrops', taskResource=TaskResource{availabilityGroup='index_kafka_vrops_a9f227794bcd31c', requiredCapacity=1}, dataSource='vrops', context={checkpoints={"0":{"4":150881239982}}, useLineageBasedSegmentAllocation=true, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, forceTimeChunkLock=true}}
2022-05-25T12:11:44,290 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_vrops_properties_ec5efd099c7bc54_dkniaoek, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_ecscap_542f2ab1f65d0f9_fmfpedff, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_vrops_properties_0da27f4d850fe43_fmdheibd, index_kafka_vrops_properties_4437f8b14c8540c_apggadoc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf, index_kafka_vrops_properties_4714686ef85e5f6_ebbjffno]]]
2022-05-25T12:11:44,298 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:44,514 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_vrops_properties_ec5efd099c7bc54_dkniaoek, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_ecscap_542f2ab1f65d0f9_fmfpedff, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_vrops_properties_4437f8b14c8540c_apggadoc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf, index_kafka_vrops_properties_4714686ef85e5f6_ebbjffno]]]
2022-05-25T12:11:44,518 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:44,772 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_vrops_properties_ec5efd099c7bc54_dkniaoek, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_ecscap_542f2ab1f65d0f9_fmfpedff, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:44,775 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,058 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_vrops_properties_ec5efd099c7bc54_hnhdnfkp, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:45,067 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,374 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_vrops_properties_ec5efd099c7bc54_hnhdnfkp, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:45,379 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,634 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_vrops_properties_ec5efd099c7bc54_hnhdnfkp, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:45,638 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[ligplbyjchbtcgg.amer.replaceddomain.com:8091] wrote SUCCESS status for task [index_kafka_vrops_a9f227794bcd31c_nohagige] on [TaskLocation{host='ligplbyjchbtcgg.amer.replaceddomain.com', port=8101, tlsPort=-1}]
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[ligplbyjchbtcgg.amer.replaceddomain.com:8091] completed task[index_kafka_vrops_a9f227794bcd31c_nohagige] with status[SUCCESS]
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Received SUCCESS status for task: index_kafka_vrops_a9f227794bcd31c_nohagige
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [notified status change from task]
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_kafka_vrops_a9f227794bcd31c_nohagige] on worker[ligplbyjchbtcgg.amer.replaceddomain.com:8091]
2022-05-25T12:11:45,690 WARN [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Unknown task completed: index_kafka_vrops_a9f227794bcd31c_nohagige
2022-05-25T12:11:45,690 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Task SUCCESS: AbstractTask{id='index_kafka_vrops_a9f227794bcd31c_nohagige', groupId='index_kafka_vrops', taskResource=TaskResource{availabilityGroup='index_kafka_vrops_a9f227794bcd31c', requiredCapacity=1}, dataSource='vrops', context={checkpoints={"0":{"4":150881239982}}, useLineageBasedSegmentAllocation=true, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, forceTimeChunkLock=true}} (4086059 run duration)
2022-05-25T12:11:45,690 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task[index_kafka_vrops_a9f227794bcd31c_nohagige] went bye bye.

Based off the logs, it appears that Peon SUCCESS status was not recognized by the Overlord in time (or polled in time - unsure how SUCCESS status gets passed to the Overlord from a PEON). It realizes it got a SUCCESS status but only after it tried to shut it down multiple times which regardless appears to mean it marks it as failure.

Can someone help me understand how SUCCESS statuses from Peons to Overlords are made? Also help me understand why this is occurring and how we can resolve it? We have a taskDuration: 1 hour and completionTimeout: 30min at the moment. We did try playing around with completionTimeout in the past but I believe theses issues persisted.

Thanks,
Peter

Hi Peter,

Are you running the Overlord in remote mode?

My initial guess is that you have to adjust properties in the overlord/runtime.properties file, possibly druid.indexer.runner.taskAssignmentTimeout.

Best,

Mark

Hi Mark,

Appreciate the response and yes we are running in remote mode.

That property says: “How long to wait after a task as been assigned to a MiddleManager before throwing an error”

Could you explain a little bit on why you believe this would help? Just trying to understand the thought process here and make sure I am thinking the same. In the Overlord logs, it shows:

2022-05-25T12:03:03,794 INFO [IndexTaskClient-vrops-3] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Task [index_kafka_vrops_a9f227794bcd31c_nohagige] paused successfully
2022-05-25T12:11:13,393 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT2S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:15,396 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT4S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:16,042 INFO [IndexTaskClient-vrops-0] org.apache.druid.indexing.common.IndexTaskClient - submitRequest failed for [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/offsets/current], with message [Connection refused (Connection refused)]
2022-05-25T12:11:19,398 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT8S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:27,404 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT10S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:37,407 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.dell.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT10S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:40,377 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [An exception occured while waiting for task [index_kafka_vrops_a9f227794bcd31c_nohagige] to pause: [org.apache.druid.java.util.common.IAE: Received 400 Bad Request with body: Can't pause, task is not in a pausable state (state: [PUBLISHING])]]
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.dell.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from activeTasks
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T07:00:00.000Z/2022-05-25T08:00:00.000Z, version='2022-05-25T08:46:31.603Z', priority=75, revoked=false}]
2022-05-25T12:11:40,390 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T08:00:00.000Z/2022-05-25T09:00:00.000Z, version='2022-05-25T09:40:09.046Z', priority=75, revoked=false}]
2022-05-25T12:11:40,401 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T09:00:00.000Z/2022-05-25T10:00:00.000Z, version='2022-05-25T10:21:47.604Z', priority=75, revoked=false}]
2022-05-25T12:11:40,411 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T10:00:00.000Z/2022-05-25T11:00:00.000Z, version='2022-05-25T11:16:12.450Z', priority=75, revoked=false}]
2022-05-25T12:11:40,416 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_vrops_a9f227794bcd31c_nohagige to status: TaskStatus{id=index_kafka_vrops_a9f227794bcd31c_nohagige, status=FAILED, duration=-1, errorMsg=An exception occured while waiting for task [index_kafka_vrops_a9f227794bcd31c_nohagige] to pause: [...}

So if I understood your logic correctly, you are thinking that it marked the task as FAILED since some action sent to MM was not executed within the default PT5M? In this case, you are thinking since the task already finished with SUCCESS - it would have gotten the SUCCESS status instead of failing it if we waited ~1 more minute (according to logs)?

That also makes me wonder how does the Overlord get the SUCCESS status from a PEON exactly? ZK, Metadata/Sql, etc? Is it polling from Overlord/pushing from Peon? Couldn’t we also try to change that frequency if available?

Thanks

We are experiencing something similar, on middlemanager, task is getting finished with SUCCESS status but later on getting tagged as failed in middlemanager as well as in coordinator logs. Is there any further findings on this case?

@Jvalant_Patel,

We are still experiencing this issue (comes and goes for us), I tried messing around with the setting that @Mark_Herrera gave and also druid.indexer.runner.taskShutdownLinkTimeout setting without luck. I also tried increasing values for properties: druid.global.http.numConnections and druid.global.http.numMaxThreads for overlord/coordinator and MM without success (was seeing if it was an issue with communications failing due to busy services) and also occassionally seeing errors like:

Bad response HTTP [no response] from [http://REDACTED:8103/druid/worker/v1/chat/index_kafka_redacted_27b6f47201f8230_ggkmecbl/status]; will try again in [PT4S] (body/exception: [Connection refused (Connection refused)])

But based off the logs, the task was already stopped (usually) and that’s why the connection refused happens. The overlord doesn’t realize the status of it yet - hence I am asking how exactly the overlord receives the status from peon (it appears to be HTTP) and if there is anyway we can debug that process a little better. Trying to figure out if there are configurable options for that step that may help us like number of retries, etc.

I’m unsure what to try next but at times it does recover on it’s own or sometimes with restarts of all components but it’s not consistent on recovery

For further troubleshooting info/background, we are running the following set up:

  • 1 Router
  • 2 Brokers
  • 2 Coordinators
  • 2 Overlords
    • druid_maxdirectmemorysize: 10500m
    • druid_xms: 15g
    • druid_xmx: 15g
    • druid.indexer.runner.type: remote
    • druid.indexer.storage.type: metadata
    • druid.global.http.numConnections: 100 (I know this is probably way too high but was trying to eliminate HTTP issues as potential cause - usually set to 50-60)
    • druid.global.http.numMaxThreads: 100 (I know this is probably way too high but was trying to eliminate HTTP issues as potential cause - usually set to default on 16 core node which is 49 according to formula)
    • druid.indexer.runner.taskAssignmentTimeout: PT10M (adjusted from default PT5M based on Mark’s recommendation)
    • druid.indexer.runner.taskShutdownLinkTimeout: PT3M (adjusted from default PT1M based on Mark’s recommendation)
  • 10 Historicals with 1.8 TB storage
  • 14 MM with 10 peon slots (we don’t use more than 2-3 on each usually, just for future scaling)
    • 4GB heap per peon

Historicals and MM are on 16 CPU / 128 GB VMs, resource usage is relatively low (cpu and memory usually below 50%) and also looking at metrics (we export using prometheus), jvm usage is also within bounds

I’m still looking into this issue but it’s hard to tell. What I did notice is that since we do have a decent amount of data (20 TB+), when we run all tasks at the same time for supervisors, overlord appears to run in a degraded state. It will take minutes do things that usually take seconds and run everything in a delayed manner.

For example, overlord requesting a new task to run, normally is fast:

2022-06-21T18:08:49,254 INFO [KafkaSupervisor-REDACT] org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem with status: TaskStatus{id=index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem, status=RUNNING, duration=-1, errorMsg=null}
2022-06-21T18:08:49,259 INFO [KafkaSupervisor-REDACT] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem] to activeTasks
2022-06-21T18:08:49,259 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem

But during times when we have issues, it can take minutes, example 3 min delay between task insertion and running:

2022-06-21T18:08:49,254 INFO [KafkaSupervisor-REDACT] org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem with status: TaskStatus{id=index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem, status=RUNNING, duration=-1, errorMsg=null}
2022-06-21T18:11:49,259 INFO [KafkaSupervisor-REDACT] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem] to activeTasks
2022-06-21T18:11:49,259 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_kafka_REDACTED_fbcf653eb7f3e8f_afmjmfem

Not sure if the above is the best example to show Overlord slowness since there are other services involved but for the specific example above, we have metrics being emitted to our prometheus server and sql query response times by other druid services seem fine during (usually in ms range) the times we have this specific issue.

It might be a configuration option but we do have these overlords on nodes with enough resources and they do not consume all resources they are allocated. Based off this behavior, it appears that the overlord is a bottleneck for our tasks to run to completion based off the general slowness I see from it when we are running all tasks at the same time (we actually don’t have that many tasks running when compared to some more bigger setups I’m sure, 26 concurrent tasks on normal load and +3-6 for compaction that may run)

What I found that could somewhat help (for us we are back to almost 100% - occasional failure or two) is offsetting your supervisor task start times. For example, for data source #1 you would start that first, then start data source #2 after ~15-30 min later, and so on and so on. My thought process for trying this is that during heavy overlord usage (task completion and status READING/PUBLISHING/etc. changes) we can try not have all tasks hit those same time windows rather offset the windows where they rely on overlord heavily.

It’s not ideal but I’m still looking into the issue when I have time and seeing if the community here has any recommendations/thoughts.

Updated previous post with more configuration for visibility/troubleshooting

We are still facing this issue, even if we start supervisors tasks in a offset manner as I mentioned before. We do hourly kafka ingestion tasks and after the first 1-2 hours of tasks, things start to run in a degraded state at least between Overlord and MM/Peon (and possibly Metadata DB since those actions seem to take a while but unsure at the moment if it’s the DB being degraded as well or if the Druid services are just taking a while to complete DB operations).

Things hold up fine actually if we don’t enable all supervisors and run just a few so this is indicating a load issue on one+ of those 3 services (Overlord, MM, Peon - and I guess possibly DB). Once we turn on all tasks, things start to fail and a lot of different types of errors show up like:

Kafka empty offsets which is weird because we are always sending data and it does run tasks again if we lower the amount of supervisors running

"errorMsg": "Task [index_kafka_REDACT_4c6fc601148dfa7_gbaamibh] returned empty offsets after pause"

Even an NPE which I haven’t seen before

022-06-22T03:23:51,896 INFO [main] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Stopping forcefully (status: [STARTING])
2022-06-22T03:23:51,896 ERROR [main] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Exception when stopping method[public void org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop()] on object[org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner@24954e82]
java.lang.reflect.InvocationTargetException: null
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_322]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_322]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_322]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_322]
    at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.stop(Lifecycle.java:465) [druid-core-0.22.1.jar:0.22.1]
    at org.apache.druid.java.util.common.lifecycle.Lifecycle.stop(Lifecycle.java:368) [druid-core-0.22.1.jar:0.22.1]
    at org.apache.druid.cli.CliPeon.run(CliPeon.java:323) [druid-services-0.22.1.jar:0.22.1]
    at org.apache.druid.cli.Main.main(Main.java:113) [druid-services-0.22.1.jar:0.22.1]
Caused by: java.lang.NullPointerException
    at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.stopForcefully(SeekableStreamIndexTaskRunner.java:1383) ~[druid-indexing-service-0.22.1.jar:0.22.1]
    at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.stopGracefully(SeekableStreamIndexTask.java:161) ~[druid-indexing-service-0.22.1.jar:0.22.1]
    at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop(SingleTaskBackgroundRunner.java:186) ~[druid-indexing-service-0.22.1.jar:0.22.1]
    ... 8 more

And also Task assignment timeout (even after increasing it to PT10M) - which indicates a problem with one+ of those 3 services not doing their job in a timely manner and running in a degraded state. We often see tasks in “WAITING” status for 3-5+ minutes as they sit there either to go to SUCCESS/FAIL status or start their actual work/“RUNNING”

2022-06-22T02:11:04,016 ERROR [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task assignment timed out on worker [REDACT.com:8091], never ran task [index_kafka_REDACT_01225427fd2835b_bkaddgea]! Timeout: (660669 >= PT10M)!: 

I know other Druid users probably have much more volume than we do (We have ~21 TB total data available in deep storage/Historicals and process ~400,000,000 metrics an hour), so this is weird since we don’t expect to overload the Druid processes with our current compute and set up. And as discussed, general resource use of Druid services doesn’t exceed thresholds that are set.

I’m unsure of what to try next at the moment, but thinking doing a debug log_level investigation would be a next first step.

For those still here - I do appreciate you hanging around hopefully we can find something useful out of this discussion. I am going to split this into multiple posts due to character limits

I was finally able to do a debug session of the Overlord and Peon services and get more details on what seems to be the issue. I believe I have found something but unsure yet what is the proper way to proceed but I will provide the details below. Some overall details useful for reading these logs:

  • Kafka Ingestion Tasks
  • taskDuration: 1 hour
  • completionTimeout: 45 minutes
  • For this task (which I’m using as an example/case study) start time is noted as: 2022-07-06T20:56:26.370Z
1 Like

#1

Based off my findings in my cluster (your cluster may be different but could be seeing something similar), this is a problem that has been around with Druid for a while or at least appears to be dating back to version: “druid-12.2-rc3”

Similar issues that were opened from other users without resolution or real engagement from community, there’s probably more but based off my key search terms:

  1. org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM shutdown · Issue #11015 · apache/druid · GitHub
  2. When I ingest, It'll reset if it's over 10 million rows. · Issue #10607 · apache/druid · GitHub
  3. Task killed by Overlord because it is not responding to Pause · Issue #7378 · apache/druid · GitHub

In our specific case, the issue stems from the Supervisor/Overlord asking the running Kafka task to “pause” (which does seem to happen frequently), and usually these “pause” requests (IPC method which uses HTTP) go through fine but often the Peon can respond with “202 Accepted” instead of the usual “200 OK” or “400 Bad Request”. Although it’s not clear to me yet why 202 was issued and it never actually paused. In our case, it stayed in “STARTING” phase when responding back to “/status” HTTP calls after the 202 and seemed to remain there until killed due to it switching to the “PUBLISHING” phase when another “/pause” request came. According to the code/logs, PUBLISHING is an unpausable state and throws an exception

Some Relevant info below:

  /**
   * Signals the ingestion loop to pause.
   *
   * @return one of the following Responses: 400 Bad Request if the task has started publishing; 202 Accepted if the
   * method has timed out and returned before the task has paused; 200 OK with a map of the current partition sequences
   * in the response body if the task successfully paused
   */
  • In my task example, this log snippet is where it asked to “pause” and responded with 202. I also noticed there is a “pause” request right before this one. The initial pause request before the 202 comes at ~50 minutes after task start and was successful (200 OK) and it is then resumed and shows as “STARTING” phase:
...
2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] org.eclipse.jetty.server.HttpOutput - write(array) s=OPEN,api=BLOCKING,sc=false,e=null aggregated !flush HeapByteBuffer@5760b9c0[p=0,l=44,c=32768,r=44]={<<<Request accepted but task has not yet paused>>>2-07-06T2...\x00\x00\x00\x00\x00\x00\x00}
2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] org.eclipse.jetty.server.handler.gzip.GzipHttpOutputInterceptor - org.eclipse.jetty.server.handler.gzip.GzipHttpOutputInterceptor@299973cb compressing java.util.zip.Deflater@52483be5
2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] org.eclipse.jetty.server.HttpChannel - sendResponse info=null content=HeapByteBuffer@716022[p=0,l=10,c=32768,r=10]={<<<\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00>>>\x00o\xA1\xD0\r\x8b\x06\xC7\x85...\x00\x00\x00\x00\x00\x00\x00} complete=false committing=true callback=GzipBufferCB@e12c8b4[content=HeapByteBuffer@5760b9c0[p=44,l=44,c=32768,r=0]={Request a... paused<<<>>>2-07-06T2...\x00\x00\x00\x00\x00\x00\x00} last=false copy=null buffer=HeapByteBuffer@716022[p=0,l=10,c=32768,r=10]={<<<\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00>>>\x00o\xA1\xD0\r\x8b\x06\xC7\x85...\x00\x00\x00\x00\x00\x00\x00} deflate=java.util.zip.Deflater@52483be5 ]
2022-07-06T21:56:35,617 DEBUG [qtp323665272-156] org.eclipse.jetty.server.HttpChannel - COMMIT for /druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause on HttpChannelOverHttp@427bc76a{s=HttpChannelState@2d006490{s=HANDLING rs=BLOCKING os=COMMITTED is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=HANDLING,uri=//REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause,age=2002}
202 Accepted HTTP/1.1
Date: Wed, 06 Jul 2022 21:56:33 GMT
X-Druid-Task-Id: index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc
Content-Type: application/json
Vary: Accept-Encoding, User-Agent
Content-Encoding: gzip
...

#2 Continued

After the 202 Accepted, Supervisor/Overlord will constantly hit the “/status” endpoint of the peon in order to see if it actually paused. After the 202 it appears to only respond with “STARTING” status

022-07-06T21:56:33,609 DEBUG [IndexTaskClient-REDACT-5] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Pause task[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]
2022-07-06T21:56:33,612 DEBUG [IndexTaskClient-REDACT-5] org.apache.druid.indexing.common.IndexTaskClient - HTTP POST: http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause
2022-07-06T21:56:33,612 DEBUG [IndexTaskClient-REDACT-5] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] starting
2022-07-06T21:56:35,618 DEBUG [HttpClient-Netty-Worker-11] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] messageReceived: DefaultHttpResponse(chunked: true)
X-Druid-Task-Id: index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc
2022-07-06T21:56:35,618 DEBUG [HttpClient-Netty-Worker-11] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] Got response: 202 Accepted
2022-07-06T21:56:35,618 DEBUG [HttpClient-Netty-Worker-11] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] messageReceived: org.jboss.netty.handler.codec.http.DefaultHttpChunk@6634e90e
2022-07-06T21:56:35,618 DEBUG [HttpClient-Netty-Worker-11] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] Got chunk: 44B, last=false
2022-07-06T21:56:35,618 DEBUG [HttpClient-Netty-Worker-11] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] messageReceived: org.jboss.netty.handler.codec.http.HttpChunk$1@47788630
2022-07-06T21:56:35,618 DEBUG [HttpClient-Netty-Worker-11] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] Got chunk: 0B, last=true
2022-07-06T21:56:35,618 DEBUG [IndexTaskClient-REDACT-5] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - GetStatus task[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]
2022-07-06T21:56:35,623 DEBUG [IndexTaskClient-REDACT-5] org.apache.druid.indexing.common.IndexTaskClient - HTTP GET: http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status
2022-07-06T21:56:35,623 DEBUG [IndexTaskClient-REDACT-5] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] starting
2022-07-06T21:56:35,628 DEBUG [HttpClient-Netty-Worker-16] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] messageReceived: DefaultHttpResponse(chunked: true)
X-Druid-Task-Id: index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc
2022-07-06T21:56:35,628 DEBUG [HttpClient-Netty-Worker-16] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] Got response: 200 OK
2022-07-06T21:56:35,628 DEBUG [HttpClient-Netty-Worker-16] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] messageReceived: org.jboss.netty.handler.codec.http.DefaultHttpChunk@4922d15b
2022-07-06T21:56:35,628 DEBUG [HttpClient-Netty-Worker-16] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] Got chunk: 9B, last=false
2022-07-06T21:56:35,628 DEBUG [HttpClient-Netty-Worker-16] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] messageReceived: org.jboss.netty.handler.codec.http.HttpChunk$1@47788630
2022-07-06T21:56:35,628 DEBUG [HttpClient-Netty-Worker-16] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] Got chunk: 0B, last=true
2022-07-06T21:56:35,628 INFO [IndexTaskClient-REDACT-5] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to change its status to [PAUSED]; will try again in [PT2S]
2022-07-06T21:56:37,628 DEBUG [IndexTaskClient-REDACT-5] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - GetStatus task[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]
2022-07-06T21:56:37,632 DEBUG [IndexTaskClient-REDACT-5] org.apache.druid.indexing.common.IndexTaskClient - HTTP GET: http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status
2022-07-06T21:56:37,632 DEBUG [IndexTaskClient-REDACT-5] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] starting
2022-07-06T21:56:37,637 DEBUG [HttpClient-Netty-Worker-21] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] messageReceived: DefaultHttpResponse(chunked: true)
X-Druid-Task-Id: index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc
2022-07-06T21:56:37,637 DEBUG [HttpClient-Netty-Worker-21] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] Got response: 200 OK
2022-07-06T21:56:37,637 DEBUG [HttpClient-Netty-Worker-21] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] messageReceived: org.jboss.netty.handler.codec.http.DefaultHttpChunk@38855127
2022-07-06T21:56:37,637 DEBUG [HttpClient-Netty-Worker-21] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] Got chunk: 9B, last=false
2022-07-06T21:56:37,637 DEBUG [HttpClient-Netty-Worker-21] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] messageReceived: org.jboss.netty.handler.codec.http.HttpChunk$1@47788630
2022-07-06T21:56:37,637 DEBUG [HttpClient-Netty-Worker-21] org.apache.druid.java.util.http.client.NettyHttpClient - [GET http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/status] Got chunk: 0B, last=true
2022-07-06T21:56:37,637 INFO [IndexTaskClient-REDACT-5] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to change its status to [PAUSED]; will try again in [PT2S]
2022-07-06T21:56:39,637 DEBUG [IndexTaskClient-REDACT-5] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - GetStatus task[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]
  • This continues for some time, I count 287 occurrences in this peon task for similar messages for retries:
2022-07-06T21:56:35,628 INFO [IndexTaskClient-REDACT-5] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to change its status to [PAUSED]; will try again in [PT2S]
...
2022-07-06T22:02:15,663 INFO [IndexTaskClient-REDACT-4] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to change its status to [PAUSED]; will try again in [PT2S]

2022-07-06T22:02:16,518 INFO [IndexTaskClient-REDACT-5] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to change its status to [PAUSED]; will try again in [PT2S]

2022-07-06T22:02:17,196 INFO [IndexTaskClient-REDACT-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to change its status to [PAUSED]; will try again in [PT2S]

2022-07-06T22:02:17,680 INFO [IndexTaskClient-REDACT-4] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to change its status to [PAUSED]; will try again in [PT2S]

2022-07-06T22:02:18,524 INFO [IndexTaskClient-REDACT-5] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to change its status to [PAUSED]; will try again in [PT2S]

2022-07-06T22:02:19,206 INFO [IndexTaskClient-REDACT-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to change its status to [PAUSED]; will try again in [PT2S]
...
...
2022-07-06T22:02:27,720 INFO [IndexTaskClient-REDACT-4] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to change its status to [PAUSED]; will try again in [PT2S]
  • Until eventually, it is killed via shutdown as it never paused and on the next pause attempt, it responded with “400 Bad Request - cannot pause a task in “PUBLISHING” phase”. According to the log snippet below, it jumped from “STARTING” phase to “PUBLISHING” between “/status” checks/calls:
2022-07-06T22:05:12,414 DEBUG [IndexTaskClient-REDACT-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Pause task[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]
2022-07-06T22:05:12,417 DEBUG [IndexTaskClient-REDACT-0] org.apache.druid.indexing.common.IndexTaskClient - HTTP POST: http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause
2022-07-06T22:05:12,417 DEBUG [IndexTaskClient-REDACT-0] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] starting
2022-07-06T22:05:12,420 DEBUG [HttpClient-Netty-Worker-9] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] messageReceived: DefaultHttpResponse(chunked: true)
X-Druid-Task-Id: index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc
2022-07-06T22:05:12,421 DEBUG [HttpClient-Netty-Worker-9] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] Got response: 400 Bad Request
2022-07-06T22:05:12,421 DEBUG [HttpClient-Netty-Worker-9] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] messageReceived: org.jboss.netty.handler.codec.http.DefaultHttpChunk@317bbf2c
2022-07-06T22:05:12,421 DEBUG [HttpClient-Netty-Worker-9] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] Got chunk: 66B, last=false
2022-07-06T22:05:12,421 DEBUG [HttpClient-Netty-Worker-9] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] messageReceived: org.jboss.netty.handler.codec.http.HttpChunk$1@47788630
2022-07-06T22:05:12,421 DEBUG [HttpClient-Netty-Worker-9] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8100/druid/worker/v1/chat/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/pause] Got chunk: 0B, last=true
2022-07-06T22:05:12,427 INFO [KafkaSupervisor-REDACT-Worker-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] because: [An exception occured while waiting for task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] to pause: [org.apache.druid.java.util.common.IAE: Received 400 Bad Request with body: Can't pause, task is not in a pausable state (state: [PUBLISHING])]]
2022-07-06T22:05:12,427 DEBUG [KafkaSupervisor-REDACT-Worker-1] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8091/druid/worker/v1/task/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/shutdown] starting
2022-07-06T22:05:12,439 DEBUG [HttpClient-Netty-Worker-6] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8091/druid/worker/v1/task/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/shutdown] messageReceived: DefaultHttpResponse(chunked: true)
2022-07-06T22:05:12,439 DEBUG [HttpClient-Netty-Worker-6] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8091/druid/worker/v1/task/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/shutdown] Got response: 200 OK
2022-07-06T22:05:12,439 DEBUG [HttpClient-Netty-Worker-6] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8091/druid/worker/v1/task/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/shutdown] messageReceived: org.jboss.netty.handler.codec.http.DefaultHttpChunk@3c164b04
2022-07-06T22:05:12,439 DEBUG [HttpClient-Netty-Worker-6] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8091/druid/worker/v1/task/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/shutdown] Got chunk: 53B, last=false
2022-07-06T22:05:12,439 DEBUG [HttpClient-Netty-Worker-6] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8091/druid/worker/v1/task/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/shutdown] messageReceived: org.jboss.netty.handler.codec.http.HttpChunk$1@47788630
2022-07-06T22:05:12,439 DEBUG [HttpClient-Netty-Worker-6] org.apache.druid.java.util.http.client.NettyHttpClient - [POST http://REDACTED.host.com:8091/druid/worker/v1/task/index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc/shutdown] Got chunk: 0B, last=true
2022-07-06T22:05:12,439 INFO [KafkaSupervisor-REDACT-Worker-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: REDACTED.host.com:8091, status 200 OK, response: {"task":"index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc"}

#3 Continued

According to the source, a task can be in one of the states:

public enum Status
  {
    NOT_STARTED,
    STARTING,
    READING,
    PAUSED,
    PUBLISHING
  }

What we are seeing right now, that it is stuck in “STARTING” Status but appears to be still doing work on the peon side with segments and actions.

This is what I see on the peon side during the 202 Accepted response

  • During the 202 Accepted, the main thing I can see but not sure what to make of it yet, is the appenderator-merge thread running that looks like it’s actually working with the data, example:
2022-07-06T21:56:34,597 INFO [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T20:00:00.000Z_2022-07-06T21:00:00.000Z_2022-07-06T20:08:32.626Z_48] of 41,979,452 bytes built from 11 incremental persist(s) in 16,246ms; pushed to deep storage in 11,819ms. Load spec is: {"type":"s3_zip","bucket":"REDACT","key":"druid/segments/REDACT/2022-07-06T20:00:00.000Z_2022-07-06T21:00:00.000Z/2022-07-06T20:08:32.626Z/48/998c542a-1af0-4d6d-8b47-0396eea79f30/index.zip","S3Schema":"s3n"}
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=0}]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=1}]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=2}]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=3}]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=4}]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=5}]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=6}]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=7}]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=8}]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=9}]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] adding hydrant[FireHydrant{queryable=REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15, count=10}]
2022-07-06T21:56:34,601 INFO [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Indexes have incompatible dimension orders, try falling back on dimension ordering from ingestionSpec
2022-07-06T21:56:34,601 INFO [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Cannot fall back on dimension ordering from ingestionSpec as it does not exist
2022-07-06T21:56:34,601 WARN [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Indexes have incompatible dimension orders and there is no valid dimension ordering in the ingestionSpec, using lexicographic order.
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMergerV9 - Using SegmentWriteOutMediumFactory[TmpFileSegmentWriteOutMediumFactory]
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMergerV9 - Completed version.bin in 0 millis.
2022-07-06T21:56:34,601 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMergerV9 - Completed factory.json in 0 millis
2022-07-06T21:56:34,602 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [133,140] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,627 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[Name] conversions with cardinality[31,146] in 25 millis.
2022-07-06T21:56:34,628 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [133,900] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,653 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[ResourceId] conversions with cardinality[31,834] in 26 millis.
2022-07-06T21:56:34,653 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [156] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,653 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[ResourceKindKey] conversions with cardinality[4] in 0 millis.
2022-07-06T21:56:34,653 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [1,644] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,654 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[cpu] conversions with cardinality[49] in 1 millis.
2022-07-06T21:56:34,654 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [20,024] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,656 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[datastore] conversions with cardinality[1,581] in 2 millis.
2022-07-06T21:56:34,656 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [32] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,656 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[devices] conversions with cardinality[2] in 0 millis.
2022-07-06T21:56:34,657 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [88] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,657 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[disk] conversions with cardinality[2] in 1 millis.
2022-07-06T21:56:34,657 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [19,900] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,660 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[diskspace] conversions with cardinality[3,010] in 3 millis.
2022-07-06T21:56:34,660 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [40] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,660 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[gpu] conversions with cardinality[4] in 0 millis.
2022-07-06T21:56:34,660 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [18,496] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,663 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[guestfilesystem] conversions with cardinality[2,297] in 3 millis.
2022-07-06T21:56:34,663 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [108,276] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,683 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[hostname] conversions with cardinality[25,109] in 20 millis.
2022-07-06T21:56:34,684 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [16,776] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,685 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[metric] conversions with cardinality[463] in 1 millis.
2022-07-06T21:56:34,686 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [816] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,686 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[net] conversions with cardinality[27] in 1 millis.
2022-07-06T21:56:34,686 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [136] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,686 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[network] conversions with cardinality[5] in 0 millis.
2022-07-06T21:56:34,686 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[ping] conversions with cardinality[7] in 0 millis.
2022-07-06T21:56:34,686 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [640] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,686 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[snapshot] conversions with cardinality[55] in 0 millis.
2022-07-06T21:56:34,687 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [88] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,687 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[storageAdapter] conversions with cardinality[2] in 1 millis.
2022-07-06T21:56:34,687 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [1,192] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,687 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[template] conversions with cardinality[288] in 0 millis.
2022-07-06T21:56:34,687 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [2,944] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,688 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[virtualDisk] conversions with cardinality[89] in 1 millis.
2022-07-06T21:56:34,688 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMerger - Allocated [152] bytes of dictionary merging direct buffers
2022-07-06T21:56:34,688 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.StringDimensionMergerV9 - Completed dim[REDACT_host] conversions with cardinality[7] in 0 millis.
2022-07-06T21:56:34,688 DEBUG [[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]-appenderator-merge] org.apache.druid.segment.IndexMergerV9 - Completed dim conversions in 86 millis.

#4 Final

To answer my own previous question - When the PUBLISHING phase is completed and overall task is done, peon is supposed to update the ZK node which will be picked up by the Supervisor/Overlord

But since it never got to complete fully due to it not actually “Pausing” and having a shutdown sent, I don’t believe it ever got the chance to update the ZK node and ends up completing shortly after. BUt it did appear to be able to publish/push segments and do handoffs since it did register “SUCCESS” status after shutdown and also I see logs for segment handoffs/publishing right before shutdown. Some peon logs below:

2022-07-06T22:04:30,711 DEBUG [coordinator_handoff_scheduled_0] org.apache.druid.segment.handoff.CoordinatorBasedSegmentHandoffNotifier - Segment Handoff complete for dataSource[REDACT] Segment[SegmentDescriptor{interval=2022-07-06T21:00:00.000Z/2022-07-06T22:00:00.000Z, version='2022-07-06T21:26:13.701Z', partitionNumber=15}]
2022-07-06T22:04:30,711 DEBUG [coordinator_handoff_scheduled_0] org.apache.druid.segment.realtime.appenderator.StreamAppenderatorDriver - Segment[vrops_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_15] successfully handed off, dropping.
2022-07-06T22:04:30,711 DEBUG [main-SendThread(ZNODE:2181)] org.apache.zookeeper.ClientCnxn - Got notification sessionid:0x200036d0cfc0076
...
...
2022-07-06T22:05:12,433 INFO [parent-monitor-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM shutdown.
2022-07-06T22:05:12,434 INFO [Thread-59] org.apache.druid.cli.CliPeon - Running shutdown hook
2022-07-06T22:05:12,434 INFO [Thread-59] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2022-07-06T22:05:12,436 DEBUG [Thread-59] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.curator.announcement.Announcer.stop()] on object[org.apache.druid.curator.announcement.Announcer@3d05435c].
2022-07-06T22:05:12,436 DEBUG [Thread-59] org.apache.druid.curator.announcement.Announcer - Stopping Announcer.
...
2022-07-06T22:05:12,437 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8101
2022-07-06T22:05:12,440 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON
2022-07-06T22:05:12,440 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8100
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8100
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8100
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8100
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8100
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/internal-discovery/PEON/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/segments/REDACTED.host.com:8100_indexer-executor__default_tier_2022-07-06T20:57:27.834Z_508f2c845e3c42e4bb186a8e241594db0
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/segments/REDACTED.host.com:8100
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8100
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/198.18.22.69:8082
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8100
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/198.18.22.77:8082
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8100
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8100
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8101
2022-07-06T22:05:12,441 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8101
...
2022-07-06T22:05:12,443 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8083
2022-07-06T22:05:12,443 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/announcements/REDACTED.host.com:8100
2022-07-06T22:05:12,443 DEBUG [main-SendThread(ZKNode:2181)] org.apache.zookeeper.ClientCnxn - Reading reply sessionid:0x200036d0cfc0076, packet:: clientPath:/druid/announcements/REDACTED.host.com:8083,3  response:: null
2022-07-06T22:05:12,443 INFO [Thread-59] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/announcements/REDACTED.host.com:8100]
...
2022-07-06T22:05:12,458 INFO [Thread-59] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2022-07-06T22:05:12,458 DEBUG [Thread-59] org.apache.druid.server.initialization.jetty.JettyServerModule - Skipping unannounce wait.
2022-07-06T22:05:12,458 DEBUG [Thread-59] org.apache.druid.server.initialization.jetty.JettyServerModule - Stopping Jetty Server...
2022-07-06T22:05:12,458 DEBUG [Thread-59] org.eclipse.jetty.util.component.AbstractLifeCycle - stopping Server@50de907a{STARTED}[9.4.40.v20210413]
2022-07-06T22:05:12,458 DEBUG [Thread-59] org.apache.druid.server.initialization.jetty.JettyServerModule - Jetty lifecycle stopping [class org.eclipse.jetty.server.Server]
2022-07-06T22:05:12,458 DEBUG [Thread-59] org.eclipse.jetty.server.Server - doStop Server@50de907a{STOPPING}[9.4.40.v20210413]
...
2022-07-06T22:05:12,482 DEBUG [Thread-59] org.apache.druid.server.initialization.jetty.JettyServerModule - Jetty lifecycle stopped [class org.eclipse.jetty.server.Server]
2022-07-06T22:05:12,482 INFO [Thread-59] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2022-07-06T22:05:12,483 DEBUG [Thread-59] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.server.coordination.ZkCoordinator.stop()] on object[org.apache.druid.server.coordination.ZkCoordinator@70777a65].
2022-07-06T22:05:12,483 INFO [Thread-59] org.apache.druid.server.coordination.ZkCoordinator - Stopping ZkCoordinator for [DruidServerMetadata{name='REDACTED.host.com:8100', hostAndPort='REDACTED.host.com:8100', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2022-07-06T22:05:12,483 DEBUG [Thread-59] org.apache.curator.framework.imps.WatcherRemovalManager - Removing watcher for path: /druid/loadQueue/REDACTED.host.com:8100
2022-07-06T22:05:12,483 DEBUG [Thread-59] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.server.coordination.SegmentLoadDropHandler.stop()] on object[org.apache.druid.server.coordination.SegmentLoadDropHandler@4ffe3d42].
2022-07-06T22:05:12,483 INFO [Thread-59] org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopping...
2022-07-06T22:05:12,483 DEBUG [Thread-59] org.apache.druid.server.coordination.CuratorDataSegmentServerAnnouncer - Unannouncing self[DruidServerMetadata{name='REDACTED.host.com:8100', hostAndPort='REDACTED.host.com:8100', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}] at [/druid/announcements/REDACTED.host.com:8100]
2022-07-06T22:05:12,483 DEBUG [Thread-59] org.apache.druid.curator.announcement.Announcer - Path[/druid/announcements/REDACTED.host.com:8100] not announced, cannot unannounce.
2022-07-06T22:05:12,483 INFO [Thread-59] org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopped.
2022-07-06T22:05:12,483 DEBUG [Thread-59] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.indexing.worker.executor.ExecutorLifecycle.stop() throws java.lang.Exception] on object[org.apache.druid.indexing.worker.executor.ExecutorLifecycle@59f76e56].
2022-07-06T22:05:12,483 DEBUG [Thread-59] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner.stop()] on object[org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner@2dd8a273].
2022-07-06T22:05:12,483 INFO [Thread-59] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc].
2022-07-06T22:05:12,483 INFO [Thread-59] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Stopping forcefully (status: [PUBLISHING])
2022-07-06T22:05:12,483 DEBUG [Thread-59] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] status changed to [FAILED].
2022-07-06T22:05:12,483 DEBUG [task-runner-0-priority-0] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-kafka-supervisor-jlnikjkf-1, groupId=kafka-supervisor-jlnikjkf] Executing onLeavePrepare with generation Generation{generationId=-1, memberId='', protocol='null'} and memberId 
2022-07-06T22:05:12,483 DEBUG [task-runner-0-priority-0] org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-kafka-supervisor-jlnikjkf-1, groupId=kafka-supervisor-jlnikjkf] Resetting generation due to consumer pro-actively leaving the group
2022-07-06T22:05:12,483 DEBUG [Thread-59] org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void org.apache.druid.java.util.http.client.NettyHttpClient.stop()] on object[org.apache.druid.java.util.http.client.NettyHttpClient@5bd3ca3c].
2022-07-06T22:05:12,484 DEBUG [kafka-producer-network-thread | producer-1] org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Sending PRODUCE request with header RequestHeader(apiKey=PRODUCE, apiVersion=9, clientId=producer-1, correlationId=1753) and timeout 30000 to node 3: {acks=1,timeout=30000,partitionSizes=[druid_metrics-3=683]}
2022-07-06T22:05:12,484 INFO [task-runner-0-priority-0] org.apache.kafka.common.metrics.Metrics - Metrics scheduler closed
2022-07-06T22:05:12,484 INFO [task-runner-0-priority-0] org.apache.kafka.common.metrics.Metrics - Closing reporter org.apache.kafka.common.metrics.JmxReporter
2022-07-06T22:05:12,484 INFO [task-runner-0-priority-0] org.apache.kafka.common.metrics.Metrics - Metrics reporters closed
2022-07-06T22:05:12,484 DEBUG [main-SendThread(ZKNODE:2181)] org.apache.zookeeper.ClientCnxn - Reading reply sessionid:0x200036d0cfc0076, packet:: clientPath:/druid/loadQueue/REDACTED.host.com:8100 serverPath:/druid/loadQueue/REDACTED.host.com:8100 finished:false header:: 288,17  replyHeader:: 288,21476706700,0  request:: '/druid/loadQueue/REDACTED.host.com:8100,3  response:: null
2022-07-06T22:05:12,486 INFO [task-runner-0-priority-0] org.apache.kafka.common.utils.AppInfoParser - App info kafka.consumer for consumer-kafka-supervisor-jlnikjkf-1 unregistered
2022-07-06T22:05:12,486 DEBUG [task-runner-0-priority-0] org.apache.kafka.clients.consumer.KafkaConsumer - [Consumer clientId=consumer-kafka-supervisor-jlnikjkf-1, groupId=kafka-supervisor-jlnikjkf] Kafka consumer has been closed
2022-07-06T22:05:12,486 DEBUG [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Shutting down immediately...
2022-07-06T22:05:12,487 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[REDACT_2022-07-06T20:00:00.000Z_2022-07-06T21:00:00.000Z_2022-07-06T20:08:32.626Z_54] at path[/druid/segments/REDACTED.host.com:8100/REDACTED.host.com:8100_indexer-executor__default_tier_2022-07-06T20:57:27.834Z_508f2c845e3c42e4bb186a8e241594db0]
2022-07-06T22:05:12,487 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[REDACT_2022-07-06T21:00:00.000Z_2022-07-06T22:00:00.000Z_2022-07-06T21:26:13.701Z_21] at path[/druid/segments/REDACTED.host.com:8100/REDACTED.host.com:8100_indexer-executor__default_tier_2022-07-06T20:57:27.834Z_508f2c845e3c42e4bb186a8e241594db0]
2022-07-06T22:05:12,487 DEBUG [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - Path[/druid/segments/REDACTED.host.com:8100/REDACTED.host.com:8100_indexer-executor__default_tier_2022-07-06T20:57:27.834Z_508f2c845e3c42e4bb186a8e241594db0] not announced, cannot unannounce.
2022-07-06T22:05:12,487 DEBUG [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - Unregistering chat handler[index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc]
2022-07-06T22:05:12,487 DEBUG [task-runner-0-priority-0] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannouncing self [{"druidNode":{"service":"druid/middleManager","host":"REDACTED.host.com","bindOnHost":false,"plaintextPort":8100,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"peon","services":{"dataNodeService":{"type":"dataNodeService","tier":"_default_tier","maxSize":0,"type":"indexer-executor","priority":0},"lookupNodeService":{"type":"lookupNodeService","lookupTier":"__default"}}}].
2022-07-06T22:05:12,488 DEBUG [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - Path[/druid/internal-discovery/PEON/REDACTED.host.com:8100] not announced, cannot unannounce.
2022-07-06T22:05:12,488 INFO [task-runner-0-priority-0] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self [{"druidNode":{"service":"druid/middleManager","host":"REDACTED.host.com","bindOnHost":false,"plaintextPort":8100,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"peon","services":{"dataNodeService":{"type":"dataNodeService","tier":"_default_tier","maxSize":0,"type":"indexer-executor","priority":0},"lookupNodeService":{"type":"lookupNodeService","lookupTier":"__default"}}}].
2022-07-06T22:05:12,489 DEBUG [kafka-producer-network-thread | producer-1] org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Received PRODUCE response from node 3 for request with header RequestHeader(apiKey=PRODUCE, apiVersion=9, clientId=producer-1, correlationId=1753): ProduceResponseData(responses=[TopicProduceResponse(name='druid_metrics', partitionResponses=[PartitionProduceResponse(index=3, errorCode=0, baseOffset=25667475537, logAppendTimeMs=-1, logStartOffset=25369377691, recordErrors=[], errorMessage=null)])], throttleTimeMs=0)
2022-07-06T22:05:12,492 DEBUG [task-runner-0-priority-0] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc] status changed to [SUCCESS].
2022-07-06T22:05:12,493 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_kafka_REDACT_a5c10ee5effa63e_bhjndmoc",
  "status" : "SUCCESS",
  "duration" : 4121927,
  "errorMsg" : null,
  "location" : {
    "host" : null,
    "port" : -1,
    "tlsPort" : -1
  }
}

I am unsure of how to exactly to proceed but hoping to get some more insight from the community and also shed some light on an ongoing issue that has been affecting possibly many users using druid across different versions.

I wonder why the “202 Accepted” response and what is the expectation for it being implemented if it can lead to scenarios like this where it never actually pauses and stays in STARTING phase? And what conditions lead to it. I am also open to tuning retries/timeouts to see if we can temporarily resolve this for now but unsure how to change those (like the PT2S and maxRetries for “Waiting for task to pause” - or rather trying to extend HTTP timeout for 202s to go away and hopefully respond with 200 after some time) and if those would even help.

Unsure if upgrading from 0.22.1 => 0.23.0 would also help, don’t see too many issues in the release about peon/overlord changes but open to discussion/advice

Okay I think I’m done editing/fixing my latest findings, apologies it was a lot and may have been unreadable during

1 Like

After more digging, we found that “202 Accepted” happen often from “pause” HTTP calls but they usually do pause/resume later. This specific issue happens when it doesn’t actually pause and if we look at the code here: druid/SeekableStreamIndexTaskRunner.java at 0.22.1 · apache/druid · GitHub

We can see that it is supposed to log a debug message when it eventually does pause/resume from 202s:

  private boolean possiblyPause() throws InterruptedException
  {
    pauseLock.lockInterruptibly();
    try {
      if (pauseRequested) {
        status = Status.PAUSED;
        hasPaused.signalAll();

        log.debug("Received pause command, pausing ingestion until resumed.");
        while (pauseRequested) {
          shouldResume.await();
        }

        status = Status.READING;
        shouldResume.signalAll();
        log.debug("Received resume command, resuming ingestion.");
        return true;
      }
    }
    finally {
      pauseLock.unlock();
    }

    return false;
  }

For normal pause with “202 Accepted” response:

  • Overlord sends HTTP request to Peon’s /pause endpoint
  • Peon thread handling HTTP request sets pauseRequested to true (line 1793), and blocks waiting for its main thread to call signalAll() on the hasPaused Condition (line 1814)
  • Peon’s main thread calls possiblyPause() repeatedly (line 579), observes that pauseRequested is true (line 1308), calls signalAll() on the hasPaused Condition (line 1310), and blocks waiting for another thread to call signalAll() on the shouldResume Condition (line 1314)
  • Peon thread handling HTTP request unblocks
  • Another thread calls signalAll() on the shouldResume Condition (line 1318/1433/1804/1845)
  • The main thread unblocks

For cases where the Peon responds with “202 Accepted” but never pauses, we never see that debug message, looks something like this:

  • Peon thread handling HTTP request is interrupted after 2-second timeout (line 1807, 1814) and returns 202 Accepted
  • Main thread SHOULD still observe that pauseRequested is true on its next call to possiblyPause()
    • Either something set pauseRequested to false before then, or something weird happened with pauseLock.lockInterruptibly() (line 1306) or hasPaused.signalAll() (line 1310)
    • We don’t see the follow log (line 1312) after a 202 Accepted response: log.debug(“Received pause command, pausing ingestion until resumed.”);

I believe this is enough information to get the ball rolling but I can definitely provide more information if needed to better understand the issue, would appreciate help from anyone.

If others are experiencing the same issue, I would also appreciate a simple comment/confirmation so we can see if this is still affecting a wide audience as indicated by the other issues I linked.

1 Like

Interesting reading @phc3 !!!

Given you’ve done so much work on this and got down to the code level, I would suggest that the next step is to raise this as an Issue in Github so that Druid developers can see it.

I’ll also ping @2bethere in case he has any insights as well…

1 Like

@petermarshallio, appreciate the suggestion! I have opened a github issue as recommended by you here: Kafka Ingestion Peon Tasks Success But Overlord Shows Failure · Issue #12701 · apache/druid · GitHub

I will also probably try to keep this thread updated with findings just to close out the open issue here as well.

Unfortunately, problem persists in 0.23.0 version. I updated the github issue with more data