Hi Shailendra,
> what is the permanent fixes here?
We need to understand the root cause for this issue in order to get to a permanent solution.
How is it happened?? Every time it happened, i fixes it by HARD_RESET . in the action column.
Supervisor stats UNHEALTHY_TASKS - means The last druid.supervisor.taskUnhealthinessThreshold tasks have all failed . “druid.supervisor.taskUnhealthinessThreshold” - The number of consecutive task failures before the supervisor is considered unhealthy. Default values for this parameter is 3 . That means if the 3 consecutive task failures will lead to this state .
Why the tasks are failing?
Looking at the log excerpts you have posted, It seems like the Kafka tasks are falling while trying to publish the segments.
2021-04-08T09:25:51,474 WARN [KafkaSupervisor-mriprodstream] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - All tasks in group [0] failed to publish, killing all tasks for these partitions
- Looking at the overlord log carefully could provide you more details on the exact cause for why the tasks are failing while publishing.
- One reason could be that, task is timing out ( timing out on - completionTimeout)- The length of time to wait before declaring a publishing task as failed and terminating it. If this is set too low, your tasks may never publish. The publishing clock for a task begins roughly after taskDuration elapses. The default value for completionTimeout is ( 30 minutes). If that the case, you may see the following messages in the overlord log -
No task in [] succeeded before the completion timeout elapsed [PT1800S]!
You can grep for the failed task_ids in the overlord log to learn more about the failure which will help you in identifying the next steps.
Thas and Regards,
Vaibhav