Re: [druid-user] supervisors status change to unhealty_tasks

Hi Shailendra,

> what is the permanent fixes here?

We need to understand the root cause for this issue in order to get to a permanent solution.

How is it happened?? Every time it happened, i fixes it by HARD_RESET . in the action column.

Supervisor stats UNHEALTHY_TASKS - means The last druid.supervisor.taskUnhealthinessThreshold tasks have all failed . “druid.supervisor.taskUnhealthinessThreshold” - The number of consecutive task failures before the supervisor is considered unhealthy. Default values for this parameter is 3 . That means if the 3 consecutive task failures will lead to this state .

Why the tasks are failing?

Looking at the log excerpts you have posted, It seems like the Kafka tasks are falling while trying to publish the segments.

2021-04-08T09:25:51,474 WARN [KafkaSupervisor-mriprodstream] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - All tasks in group [0] failed to publish, killing all tasks for these partitions

  • Looking at the overlord log carefully could provide you more details on the exact cause for why the tasks are failing while publishing.
  • One reason could be that, task is timing out ( timing out on - completionTimeout)- The length of time to wait before declaring a publishing task as failed and terminating it. If this is set too low, your tasks may never publish. The publishing clock for a task begins roughly after taskDuration elapses. The default value for completionTimeout is ( 30 minutes). If that the case, you may see the following messages in the overlord log -
    No task in [] succeeded before the completion timeout elapsed [PT1800S]!

You can grep for the failed task_ids in the overlord log to learn more about the failure which will help you in identifying the next steps.

Thas and Regards,
Vaibhav