In one of our environments, we had a middle manager run out of disk space. Even after the tasks on that middle manager died, we continued seeing the tasks in the coordinator console. I don’t believe this is the first time we’ve seen zombie tasks in the console.
In general, is there a task heart-beating mechanism of some sort?
How do the overlord and coordinator recognize when a task is gone?
(or do we need to right some monitoring/healthcheck mechanism to monitor tasks, and manually fail them when they die?)
thanks (again) in advance,