Task Heartbeating?


In one of our environments, we had a middle manager run out of disk space. Even after the tasks on that middle manager died, we continued seeing the tasks in the coordinator console. I don’t believe this is the first time we’ve seen zombie tasks in the console.

So —

In general, is there a task heart-beating mechanism of some sort?

How do the overlord and coordinator recognize when a task is gone?

(or do we need to right some monitoring/healthcheck mechanism to monitor tasks, and manually fail them when they die?)

thanks (again) in advance,


Hey Brian,

The middleManager process heartbeats for its own tasks through ZooKeeper. If the MM process dies, the Overlord has a grace period for waiting for it to come back (as you might just have restarted it, or there might have been a network glitch), and then when that grace period is over the tasks all fail simultaneously. This grace period is given by druid.indexer.runner.taskCleanupTimeout (see http://druid.io/docs/latest/configuration/indexing-service.html).