[druid-user] Supervisor tasks kept running even after Supervisor termination

Hi,
I am using the Druid with Version: 0.21.0 and facing issue with druid supervisor’s kafka indexing tasks as tasks kept running even after the supervisor is terminated and datasource id deleted. I have also deleted the druid meta data from sql server and restarted the pods still tasks kept rescheduling.

When analysed the logs found overlord is assigning the task and removing task from active tasks in every minute which ends with the JVM shutdown for the running task.

Below are some logs for same

Overlord Logs

Overlord logs
2022-03-11T06:35:14,890 INFO [KafkaSupervisor-events_datasource] org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_kafka_events_datasource with status: TaskStatus{id=index_kafka_events_datasource, status=RUNNING, duration=-1, errorMsg=null}
2022-03-11T06:35:14,901 INFO [KafkaSupervisor-events_datasource] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[index_kafka_events_datasource] to activeTasks
2022-03-11T06:35:14,902 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_kafka_events_datasource
2022-03-11T06:35:14,902 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_kafka_events_datasource
2022-03-11T06:35:14,902 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[10.244.13.154:8091] to add task[index_kafka_events_datasource]
2022-03-11T06:35:14,906 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task index_kafka_events_datasource switched from pending to running (on [10.244.13.154:8091])
2022-03-11T06:35:14,920 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[10.244.13.154:8091] wrote RUNNING status for task [index_kafka_events_datasource] on [TaskLocation{host=‘null’, port=-1, tlsPort=-1}]
2022-03-11T06:35:14,923 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[10.244.13.154:8091] wrote RUNNING status for task [index_kafka_events_datasource] on [TaskLocation{host=‘10.244.13.154’, port=8100, tlsPort=-1}]
2022-03-11T06:35:31,118 INFO [KafkaSupervisor-events_datasource] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id=‘events_datasource’, generationTime=2022-03-11T06:35:31.118Z, payload=KafkaSupervisorReportPayload{dataSource=‘events_datasource’, topic=‘events_datasource_topic’, partitions=1, replicas=1, durationSeconds=3600, active=[{id=‘index_kafka_events_datasource’, startTime=2022-03-11T06:35:23.416Z, remainingSeconds=3592}], publishing=, suspended=false, healthy=false, state=UNHEALTHY_TASKS, detailedState=UNHEALTHY_TASKS, recentErrors=}}
2022-03-11T06:35:31,160 INFO [KafkaSupervisor-events_datasource] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id=‘events_datasource’, generationTime=2022-03-11T06:35:31.160Z, payload=KafkaSupervisorReportPayload{dataSource=‘events_datasource’, topic=‘events_datasource_topic’, partitions=1, replicas=1, durationSeconds=3600, active=[{id=‘index_kafka_events_datasource’, startTime=2022-03-11T06:35:23.416Z, remainingSeconds=3592}], publishing=, suspended=false, healthy=false, state=UNHEALTHY_TASKS, detailedState=UNHEALTHY_TASKS, recentErrors=}}

ERROR in OVERLORD Removing the task form activeTasks
2022-03-11T06:36:00,094 INFO [TaskQueue-StorageSync] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[index_kafka_events_datasource] from activeTasks
2022-03-11T06:36:00,094 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_skills]]]
2022-03-11T06:36:00,095 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: 10.244.13.154:8091, status 200 OK, response: {“task”:“index_kafka_events_datasource”}
2022-03-11T06:36:00,146 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_events_datasource] because: [task is not in knownTaskIds[[index_kafka_skills]]]
2022-03-11T06:36:00,147 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: 10.244.13.154:8091, status 200 OK, response: {“task”:“index_kafka_events_datasource”}
2022-03-11T06:36:00,494 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[10.244.13.154:8091] wrote FAILED status for task [index_kafka_events_datasource] on [TaskLocation{host=‘10.214.13.134’, port=8100, tlsPort=-1}]
2022-03-11T06:36:00,494 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Worker[10.244.13.154:8091] completed task[index_kafka_events_datasource] with status[FAILED]
2022-03-11T06:36:00,494 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Received FAILED status for task:
2022-03-11T06:36:00,494 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown [index_kafka_events_datasource] because: [notified status change from task]
2022-03-11T06:36:00,494 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_kafka_events_datasource] on worker[10.244.13.154:8091]
2022-03-11T06:36:00,496 WARN [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Unknown task completed: index_kafka_events_datasource
2022-03-11T06:36:00,496 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.TaskQueue - Task FAILED: AbstractTask{id=‘index_kafka_events_datasource’, groupId=‘index_kafka_events_datasource’, taskResource=TaskResource{availabilityGroup=‘index_kafka_events_datasource_30c2b50db08af1a’, requiredCapacity=1}, dataSource=‘events_datasource’, context={forceTimeChunkLock=true, checkpoints={“0”:{“0”:126063}}, IS_INCREMENTAL_HANDOFF_SUPPORTED=true}} (45572 run duration)
2022-03-11T06:36:00,496 INFO [Curator-PathChildrenCache-1] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task[index_kafka_events_datasource] went bye bye.

MiddleManager Logs

corresponding ERROR in middleManager

2022-03-11T06:36:00,095 INFO [qtp1365833373-100] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_kafka_events_datasource] because: [shut down request via HTTP endpoint]
2022-03-11T06:36:00,095 INFO [qtp1365833373-100] org.apache.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_kafka_events_datasource].
2022-03-11T06:36:00,147 INFO [qtp1365833373-124] org.apache.druid.indexing.overlord.ForkingTaskRunner - Shutdown [index_kafka_events_datasource] because: [shut down request via HTTP endpoint]
2022-03-11T06:36:00,147 INFO [qtp1365833373-124] org.apache.druid.indexing.overlord.ForkingTaskRunner - Closing output stream to task[index_kafka_events_datasource].
2022-03-11T06:36:00,231 INFO [forking-task-runner-2-[index_kafka_events_datasource]] org.apache.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[2] for task: index_kafka_events_datasource
2022-03-11T06:36:00,232 INFO [forking-task-runner-2] com.cloud.bots.druid.storage.OciTaskLogs - Pushing task log var/druid/task/index_kafka_events_datasource/log to druid/indexer-logs/index_kafka_events_datasource/log
2022-03-11T06:36:00,365 INFO [forking-task-runner-2] com.cloud.bots.druid.storage.OciTaskLogs - Pushing task reports var/druid/task/index_kafka_events_datasource/990d3707-37af-4f30-b98f-bcfd9ffaf12b/report.json to druid/indexer-logs/index_kafka_events_datasource/report.json
2022-03-11T06:36:00,485 INFO [forking-task-runner-2] org.apache.druid.indexing.overlord.ForkingTaskRunner - Removing task directory: var/druid/task/index_kafka_events_datasource
2022-03-11T06:36:00,493 INFO [WorkerTaskManager-NoticeHandler] org.apache.druid.indexing.worker.WorkerTaskManager - Task [index_kafka_events_datasource] completed with status [FAILED].

Corresponding issue in kafka index task logs

2022-03-11T06:36:00,095 INFO [parent-monitor-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM shutdown.
2022-03-11T06:36:00,096 INFO [Thread-69] org.apache.druid.cli.CliPeon - Running shutdown hook
2022-03-11T06:36:00,098 INFO [Thread-69] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]

Not sure about the why the task keep getting again and again if supervisor is suspended/terminated, but the overlord is removing the task from the active task list even when task is running, due to which task get terminated.

Waiting for valuable suggestions.

Thanks
Kuldeep Gaur

That sounds messed up. In the absence of better advice, have you tried restarting the coordinator (s)?

Yes, Tried it. Tasks are still getting created even without the supervisors.

Just as an update,

I tried to terminate the datasource along with the supervisors, deleted the metadata stored in sql and restarted the whole system, still the tasks are getting scheduled without the supervisor and datasource.

Thanks

Just a thought. Could there be a conflict among multiple clusters using the same resources? same Metadata DB or same zookeeper config in more than one cluster?

Hi, have you found the root cause of this problem? We are something similar where coordinator is starting the ingestion tasks and killing those in a minute or two with “because: [task is not in knownTaskIds” error. Any pointers will be super helpful.