- Druid Version: 0.22.1
- Kafka Ingestion (idempotent producer)
We recently started having intermittent problems with Kafka tasks failing but seems irregular because the Peon tasks logs shows “SUCCESS” but the Overlord marks the task as “FAILED”.
From what I can see, it appears that the Peon Task says SUCCESS and then Overlord attempts to reach out to the Peon via HTTP but since the Peon task already shows SUCCESS and exits/shutdowns, Overlord calls fail and eventually marks the task as “FAILED”.
Peon Log:
2022-05-25T12:11:00,996 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status
...
...
...
2022-05-25T12:11:01,199 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [INIT]
Finished peon task
Overlord Logs:
2022-05-25T12:11:13,393 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT2S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:15,396 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT4S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:16,042 INFO [IndexTaskClient-vrops-0] org.apache.druid.indexing.common.IndexTaskClient - submitRequest failed for [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/offsets/current], with message [Connection refused (Connection refused)]
2022-05-25T12:11:19,398 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT8S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:27,404 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT10S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:37,407 WARN [IndexTaskClient-vrops-1] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://ligplbyjchbtcgg.amer.replaceddomain.com:8101/druid/worker/v1/chat/index_kafka_vrops_a9f227794bcd31c_nohagige/pause]; will try again in [PT10S] (body/exception: [Connection refused (Connection refused)])
2022-05-25T12:11:40,377 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [An exception occured while waiting for task [index_kafka_vrops_a9f227794bcd31c_nohagige] to pause: [org.apache.druid.java.util.common.IAE: Received 400 Bad Request with body: Can't pause, task is not in a pausable state (state: [PUBLISHING])]]
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from activeTasks
2022-05-25T12:11:40,382 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T07:00:00.000Z/2022-05-25T08:00:00.000Z, version='2022-05-25T08:46:31.603Z', priority=75, revoked=false}]
2022-05-25T12:11:40,390 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T08:00:00.000Z/2022-05-25T09:00:00.000Z, version='2022-05-25T09:40:09.046Z', priority=75, revoked=false}]
2022-05-25T12:11:40,401 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T09:00:00.000Z/2022-05-25T10:00:00.000Z, version='2022-05-25T10:21:47.604Z', priority=75, revoked=false}]
2022-05-25T12:11:40,411 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_vrops_a9f227794bcd31c_nohagige] from TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_vrops', dataSource='vrops', interval=2022-05-25T10:00:00.000Z/2022-05-25T11:00:00.000Z, version='2022-05-25T11:16:12.450Z', priority=75, revoked=false}]
2022-05-25T12:11:40,416 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task index_kafka_vrops_a9f227794bcd31c_nohagige to status: TaskStatus{id=index_kafka_vrops_a9f227794bcd31c_nohagige, status=FAILED, duration=-1, errorMsg=An exception occured while waiting for task [index_kafka_vrops_a9f227794bcd31c_nohagige] to pause: [...}
2022-05-25T12:11:40,420 INFO [KafkaSupervisor-vrops-Worker-0] org.apache.druid.indexing.overlord.TaskQueue - Task done: AbstractTask{id='index_kafka_vrops_a9f227794bcd31c_nohagige', groupId='index_kafka_vrops', taskResource=TaskResource{availabilityGroup='index_kafka_vrops_a9f227794bcd31c', requiredCapacity=1}, dataSource='vrops', context={checkpoints={"0":{"4":150881239982}}, useLineageBasedSegmentAllocation=true, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, forceTimeChunkLock=true}}
2022-05-25T12:11:44,290 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_vrops_properties_ec5efd099c7bc54_dkniaoek, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_ecscap_542f2ab1f65d0f9_fmfpedff, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_vrops_properties_0da27f4d850fe43_fmdheibd, index_kafka_vrops_properties_4437f8b14c8540c_apggadoc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf, index_kafka_vrops_properties_4714686ef85e5f6_ebbjffno]]]
2022-05-25T12:11:44,298 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:44,514 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_vrops_properties_ec5efd099c7bc54_dkniaoek, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_ecscap_542f2ab1f65d0f9_fmfpedff, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_vrops_properties_4437f8b14c8540c_apggadoc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf, index_kafka_vrops_properties_4714686ef85e5f6_ebbjffno]]]
2022-05-25T12:11:44,518 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:44,772 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_vrops_properties_ec5efd099c7bc54_dkniaoek, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_ecscap_542f2ab1f65d0f9_fmfpedff, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:44,775 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,058 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_srm_bbfd58d981cd115_pnhjpgjp, index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_srm_32a67a2905a0ef9_doedneah, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_vrops_properties_ec5efd099c7bc54_hnhdnfkp, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:45,067 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,374 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_druid_metrics_8be3b0155a73ea0_fghicpfb, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_vrops_properties_ec5efd099c7bc54_hnhdnfkp, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:45,379 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,634 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [task is not in knownTaskIds[[index_kafka_zabbix_b980b711cff794a_flbfkppl, index_kafka_zabbix_04e7deda85b7b09_mejdnkkm, index_kafka_vrops_90fddd7dd00aec7_fjkjclgi, index_kafka_zabbix_63dd19779ab98c3_plfmield, index_kafka_vrops_properties_0da27f4d850fe43_nhfdlkil, index_kafka_prom_lighthouse_94fbe74013e94c5_pmfeeebk, index_kafka_prom_lighthouse_080ca9811b1d2d4_aoiknogc, index_kafka_vrops_25dc2aa6be3a300_gohlcfap, index_kafka_vrops_6d765fcf8a23c1e_docphgjp, index_kafka_vrops_properties_4437f8b14c8540c_dkiifcef, index_kafka_druid_metrics_a0629263d9acc3d_acdkbfmc, index_kafka_srm_5eb82c47822b3d1_gfdgdohl, index_kafka_vrops_properties_ec5efd099c7bc54_hnhdnfkp, index_kafka_prom_lighthouse_9fd06cdb17ad979_ghaijclo, index_kafka_prom_lighthouse_a44e1099977e001_njccpjej, index_kafka_srm_a1139b196797600_fckbfele, index_kafka_vrops_properties_76bf5c281336db3_lepnhmok, index_kafka_vrops_properties_6a0d97ed4d2e3e2_ccdemjjk, index_kafka_zabbix_ecc8882ac8a5e0b_anahjkgg, index_kafka_vrops_properties_4714686ef85e5f6_ajanocfp, index_kafka_prom_lighthouse_97f51e2a06fadd0_eddjgnfl, index_kafka_ecscap_c79f891d46e541a_ppblajhf]]]
2022-05-25T12:11:45,638 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: ligplbyjchbtcgg.amer.replaceddomain.com:8091, status 200 OK, response: {"task":"index_kafka_vrops_a9f227794bcd31c_nohagige"}
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[ligplbyjchbtcgg.amer.replaceddomain.com:8091] wrote SUCCESS status for task [index_kafka_vrops_a9f227794bcd31c_nohagige] on [TaskLocation{host='ligplbyjchbtcgg.amer.replaceddomain.com', port=8101, tlsPort=-1}]
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[ligplbyjchbtcgg.amer.replaceddomain.com:8091] completed task[index_kafka_vrops_a9f227794bcd31c_nohagige] with status[SUCCESS]
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Received SUCCESS status for task: index_kafka_vrops_a9f227794bcd31c_nohagige
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_vrops_a9f227794bcd31c_nohagige] because: [notified status change from task]
2022-05-25T12:11:45,687 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_kafka_vrops_a9f227794bcd31c_nohagige] on worker[ligplbyjchbtcgg.amer.replaceddomain.com:8091]
2022-05-25T12:11:45,690 WARN [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Unknown task completed: index_kafka_vrops_a9f227794bcd31c_nohagige
2022-05-25T12:11:45,690 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Task SUCCESS: AbstractTask{id='index_kafka_vrops_a9f227794bcd31c_nohagige', groupId='index_kafka_vrops', taskResource=TaskResource{availabilityGroup='index_kafka_vrops_a9f227794bcd31c', requiredCapacity=1}, dataSource='vrops', context={checkpoints={"0":{"4":150881239982}}, useLineageBasedSegmentAllocation=true, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, forceTimeChunkLock=true}} (4086059 run duration)
2022-05-25T12:11:45,690 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task[index_kafka_vrops_a9f227794bcd31c_nohagige] went bye bye.
Based off the logs, it appears that Peon SUCCESS status was not recognized by the Overlord in time (or polled in time - unsure how SUCCESS status gets passed to the Overlord from a PEON). It realizes it got a SUCCESS status but only after it tried to shut it down multiple times which regardless appears to mean it marks it as failure.
Can someone help me understand how SUCCESS statuses from Peons to Overlords are made? Also help me understand why this is occurring and how we can resolve it? We have a taskDuration: 1 hour and completionTimeout: 30min at the moment. We did try playing around with completionTimeout in the past but I believe theses issues persisted.
Thanks,
Peter