Task Heartbeating?

All,

In one of our environments, we had a middle manager run out of disk space. Even after the tasks on that middle manager died, we continued seeing the tasks in the coordinator console. I don’t believe this is the first time we’ve seen zombie tasks in the console.

So —

In general, is there a task heart-beating mechanism of some sort?

How do the overlord and coordinator recognize when a task is gone?

(or do we need to right some monitoring/healthcheck mechanism to monitor tasks, and manually fail them when they die?)

thanks (again) in advance,

-brian

Hey Brian,

The middleManager process heartbeats for its own tasks through ZooKeeper. If the MM process dies, the Overlord has a grace period for waiting for it to come back (as you might just have restarted it, or there might have been a network glitch), and then when that grace period is over the tasks all fail simultaneously. This grace period is given by druid.indexer.runner.taskCleanupTimeout (see http://druid.io/docs/latest/configuration/indexing-service.html).