Druid Ingestion Tasks Stuck in Pending State

Hi,

I have been using druid version 0.17.0. I am using a single host docker based system for test purposes. The system has been running fine for 2 weeks. It has been able to handle streaming and batch loads fine till very recently. But now all new tasks are ending up in PENDING status and no progress is being made. Both streaming and batch ingestion tasks are getting affected.

In overlord process logs I see following logs =>

2020-04-03T09:10:48,595 INFO [qtp890160784-67] org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_parallel_XXXXX_pjglcghe_2020-04-03T09:10:48.594Z with status: TaskStatus{id=index_parallel_XXXXX_pjglcghe_2020-04-03T09:10:48.594Z, status=RUNNING, duration=-1, errorMsg=null}

2020-04-03T09:10:48,597 INFO [qtp890160784-67] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[index_parallel_XXXXX_pjglcghe_2020-04-03T09:10:48.594Z] to activeTasks

2020-04-03T09:10:48,598 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_parallel_XXXXX_pjglcghe_2020-04-03T09:10:48.594Z

2020-04-03T09:10:48,598 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_parallel_XXXXX_pjglcghe_2020-04-03T09:10:48.594Z

2020-04-03T09:10:54,916 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Assigned a task[index_parallel_XXXXX_pjglcghe_2020-04-03T09:10:48.594Z] that is already pending!

2020-04-03T09:11:54,916 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.RemoteTaskRunner - Assigned a task[index_parallel_XXXXX_pjglcghe_2020-04-03T09:10:48.594Z] that is already pending!

This particular log - Assigned a task[index_parallel_XXXXX_pjglcghe_2020-04-03T09:10:48.594Z] that is already pending - keeps repeating in the log file but no progress is being made. There are 5 tasks in this state (3 streaming tasks and 2 batch ingestion tasks).

Any help on following questions is highly appreciated -

  1. What may be causing this issue? The tasks are in pending state for more that 10 hours now. Any new tasks are ending up in a similar state.

  2. How can I get the system back to the working state? Is it safe to restart the processes in this state? Which processes need to be restarted?

  3. Is if safe to kill these tasks through REST API calls?

Thanks,

Shashi

Hi Shashi,

  1. What may be causing this issue? The tasks are in pending state for more that 10 hours now. Any new tasks are ending up in a similar state.

It could be caused if no capacity left on workers, not enough workers or the pending task keeps waiting for locks held by other tasks. What is your druid.worker.capacity set to in middlemanager/indexer config ?

  1. How can I get the system back to the working state? Is it safe to restart the processes in this state? Which processes need to be restarted?

You could try restarting middlemanager and overlord process.

  1. Is if safe to kill these tasks through REST API calls?

I think it’s safe to delete tasks, those should be recreated by the supervisor.