All tasks are shut down after zookeeper failure

Hi all,
we use Tranquility 0.4.2 and Druid 0.7.3. We have Zookeeper cluster composed of three nodes. The problem is every time we kill one of the ZK nodes the overlord kills all running realtime tasks. It seems that the Overlord finds out that the Zookeeper node is down, it reconnects to another ZK node but then it fails to retrieve data of the running tasks. The middle manager tells the Overlord that there are realtime tasks running but the Overlord doesn’t know them thus it sends the shutdown message to all of them. I extracted some important lines from Overlord log, see attachment for the full log please.

2016-02-11T08:30:42,528 INFO [LeaderSelector-0] io.druid.indexing.overlord.TaskLockbox - Synced 0 locks for 0 tasks from storage (0 locks ignored).

2016-02-11T08:30:42,613 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[stats-druidmidman1:8091] wrote RUNNING status for task: index_realtime_sdgadserver-videoimpression_2016-02-11T00:00:00.000Z_0_0

2016-02-11T08:30:42,615 WARN [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[stats-druidmidman1:8091] announced a status for a task I didn’t know about, adding to runningTasks: index_realtime_sdgadserver-videoimpression_2016-02-11T00:00:00.000Z_0_0

2016-02-11T08:30:42,806 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: stats-druidmidman1:8091, status 200 OK, response: {“task”:“index_realtime_sdgadserver-videoimpression_2016-02-11T00:00:00.000Z_0_0”}

2016-02-11T08:30:42,807 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_sdgadserver-videoimpression_2016-02-11T00:00:00.000Z_0_0! Are you sure the task was running?

This repeats for every realtime task. Can you help me find the problem please? I don’t see any suspected error messages in other logs. I looked to /tranquility/beams/druid:prod:overlord on every ZK node using ZK console and I saw the exact same data, so I don’t know why should the overlord retrieve information about zero tasks.

Lukáš

overlord.txt (17.4 KB)

Hi Lukas, if you update to 0.8.3, tasks should be restartable.

Hit “send” too fast. This is a known issue in older versions of Druid and was greatly improved upon in the last Druid stable, which can restore task state after ZK failure or middlemanager shut down.

Hi Fangjin,
that’s great, finally we have an ultimate reason to update Druid to the newest version! :slight_smile: Thank you.

Lukáš

Hi Fangjin,
I’ve updated the Druid to the newest stable version, 0.8.3, but I think I still see similar error. We have a test cluster with one Overlord and one running indexing task. Whenever I restart the Overlord, the Overlord will shut down the running task. After the restart the Overlord is unable to load data about running tasks from Zookeeper (it always sync zero tasks from storage although I can see the data in ZK console) and because of that it shuts down them. Is it an expected behavior? See log in attachment please. The tasks are restored fine when I restart the Middle manager but the Overlord outage is still a problem.

overlord.log (2.46 KB)

Hey Lukáš,

Have you set druid.indexer.storage.type=metadata? The behavior you’re seeing sounds like the overlord is using in-heap task storage, which would cause it to forget about everything whenever it restarts.

Hi Gian, yeah that was it! I’ve missed that configuration. Thank you!