Recovering from a network partition

Hi, we are doing some failure testing on our development Druid cluster by inducing long network partitions (almost one hour) and have observed that several Druid services have trouble reconnecting to their peers. For example, on a host that’s running a middle manager, after reconnecting the node to the network after an hour’s disconnection, the middle manager is not listed in the overlord console. The process itself is still up and running and not logging any errors. Does anyone have insight into what might be going on? Thanks!
–T

Hey TJ,

Which version of Druid are you running?

Also, I wonder where exactly things are getting desynced. Do you see the MM in question announcing itself in /druid/indexer/announcements? If you restart the overlord (without restarting the MM) then does it show back up?

We are running Druid 0.8.1. When we induce the network partition, the MM immediately drops out of the Overlord’s list of active Middle Managers and we observe that the MM in question is not announcing in /druid/indexer/announcements. Then when we remove the partition, we still don’t see the MM announcing itself in /druid/indexer/announcements. Interestingly, the realtime ingestion peons that were running on that host continued to run. At about 2 hours after the partition was removed, the logs seem to show that the peons finally exit and experience errors. However, at this time, the MM actually successfully reconnects to the rest of the cluster, announces itself on /druid/indexer/announcements, and is observable through the Overlord console. Here are the logs after we removed the partition:

2016-05-12T18:37:12,429 INFO [HttpPostEmitter-1-0] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://dwh02.dev.skyportsystems.com:8888

2016-05-12T18:53:12,540 INFO [HttpPostEmitter-1-0] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://dwh02.dev.skyportsystems.com:8888

159119.639: [GC159119.639: [ParNew: 18536K->725K(19648K), 0.0021630 secs] 32107K->14296K(63360K), 0.0023100 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]

2016-05-12T20:27:31,141 INFO [pool-9-thread-3] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[0] for task: index_realtime_observations_2016-05-12T18:00:00.000Z_0_0

2016-05-12T20:27:31,141 INFO [pool-9-thread-3] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Writing task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_observations_2016-05-12T18_00_00.000Z_0_0

2016-05-12T20:27:31,283 INFO [pool-9-thread-3] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Wrote task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_observations_2016-05-12T18_00_00.000Z_0_0

2016-05-12T20:27:31,284 INFO [pool-9-thread-3] io.druid.indexing.overlord.ForkingTaskRunner - Removing temporary directory: /tmp/persistent/task/index_realtime_observations_2016-05-12T18:00:00.000Z_0_0/17419c42-f9ef-4618-a0e9-8a1a8732e509

2016-05-12T20:27:46,288 WARN [WorkerTaskMonitor-0] org.apache.curator.ConnectionState - Connection attempt unsuccessful after 8510795 (greater than max timeout of 30000). Resetting connection and trying again with a new connection.

2016-05-12T20:27:46,290 INFO [WorkerTaskMonitor-0] org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=dwh04.dev.skyportsystems.com,dwh02.dev.skyportsystems.com,dwh03.dev.skyportsystems.com sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@b261941

2016-05-12T20:27:46,294 INFO [WorkerTaskMonitor-0-SendThread(dwh02.dev.skyportsystems.com:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server dwh02.dev.skyportsystems.com/172.16.50.200:2181. Will not attempt to authenticate using SASL (unknown error)

2016-05-12T20:27:46,296 INFO [WorkerTaskMonitor-0-SendThread(dwh02.dev.skyportsystems.com:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to dwh02.dev.skyportsystems.com/172.16.50.200:2181, initiating session

2016-05-12T20:27:46,338 INFO [WorkerTaskMonitor-0-SendThread(dwh02.dev.skyportsystems.com:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server dwh02.dev.skyportsystems.com/172.16.50.200:2181, sessionid = 0x354180afd58e007, negotiated timeout = 30000

2016-05-12T20:27:46,379 INFO [WorkerTaskMonitor-0] org.apache.zookeeper.ZooKeeper - Session: 0x354180afd58e007 closed

2016-05-12T20:27:46,379 INFO [WorkerTaskMonitor-0-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down

2016-05-12T20:27:46,379 INFO [WorkerTaskMonitor-0] org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=dwh04.dev.skyportsystems.com,dwh02.dev.skyportsystems.com,dwh03.dev.skyportsystems.com sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@b261941

2016-05-12T20:27:46,382 INFO [WorkerTaskMonitor-0-SendThread(dwh04.dev.skyportsystems.com:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server dwh04.dev.skyportsystems.com/172.16.50.202:2181. Will not attempt to authenticate using SASL (unknown error)

2016-05-12T20:27:46,383 INFO [WorkerTaskMonitor-0-SendThread(dwh04.dev.skyportsystems.com:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to dwh04.dev.skyportsystems.com/172.16.50.202:2181, initiating session

2016-05-12T20:27:46,399 INFO [WorkerTaskMonitor-0-SendThread(dwh04.dev.skyportsystems.com:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server dwh04.dev.skyportsystems.com/172.16.50.202:2181, sessionid = 0x15483054eb63339, negotiated timeout = 30000

2016-05-12T20:27:46,399 INFO [WorkerTaskMonitor-0-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED

2016-05-12T20:27:46,401 INFO [WorkerTaskMonitor-0-EventThread] io.druid.curator.announcement.Announcer - Node[/druid/indexer/announcements/dwh01.dev.skyportsystems.com:8080] dropped, reinstating.

2016-05-12T20:27:46,478 ERROR [WorkerTaskMonitor-0] io.druid.indexing.worker.WorkerTaskMonitor - Failed to update task status: {class=io.druid.indexing.worker.WorkerTaskMonitor, exceptionType=class java.lang.RuntimeException, exceptionMessage=org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /druid/indexer/status/dwh01.dev.skyportsystems.com:8080/index_realtime_observations_2016-05-12T18:00:00.000Z_0_0, task=index_realtime_observations_2016-05-12T18:00:00.000Z_0_0}

java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /druid/indexer/status/dwh01.dev.skyportsystems.com:8080/index_realtime_observations_2016-05-12T18:00:00.000Z_0_0

at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

at io.druid.indexing.worker.WorkerCuratorCoordinator.announceTaskAnnouncement(WorkerCuratorCoordinator.java:219) ~[druid-indexing-service-0.8.1.jar:0.8.1]

at io.druid.indexing.worker.WorkerCuratorCoordinator.updateAnnouncement(WorkerCuratorCoordinator.java:233) ~[druid-indexing-service-0.8.1.jar:0.8.1]

at io.druid.indexing.worker.WorkerTaskMonitor$1$1.run(WorkerTaskMonitor.java:159) [druid-indexing-service-0.8.1.jar:0.8.1]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_67]

at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_67]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_67]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_67]

at java.lang.Thread.run(Thread.java:745) [?:1.7.0_67]

Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /druid/indexer/status/dwh01.dev.skyportsystems.com:8080/index_realtime_observations_2016-05-12T18:00:00.000Z_0_0

at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) ~[zookeeper-3.4.6.jar:3.4.6-1569965]

at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.6.jar:3.4.6-1569965]

at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) ~[zookeeper-3.4.6.jar:3.4.6-1569965]

at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:696) ~[curator-framework-2.8.0.jar:?]

at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:679) ~[curator-framework-2.8.0.jar:?]

at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.8.0.jar:?]

at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:676) ~[curator-framework-2.8.0.jar:?]

at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453) ~[curator-framework-2.8.0.jar:?]

at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443) ~[curator-framework-2.8.0.jar:?]

at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44) ~[curator-framework-2.8.0.jar:?]

at io.druid.indexing.worker.WorkerCuratorCoordinator.announceTaskAnnouncement(WorkerCuratorCoordinator.java:212) ~[druid-indexing-service-0.8.1.jar:0.8.1]

… 7 more

2016-05-12T20:34:47,369 INFO [pool-9-thread-2] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[0] for task: index_realtime_delta_netflow_2016-05-12T18:00:00.000Z_0_0

2016-05-12T20:34:47,370 INFO [pool-9-thread-2] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Writing task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_delta_netflow_2016-05-12T18_00_00.000Z_0_0

2016-05-12T20:34:47,730 INFO [pool-9-thread-2] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Wrote task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_delta_netflow_2016-05-12T18_00_00.000Z_0_0

2016-05-12T20:34:47,731 INFO [pool-9-thread-2] io.druid.indexing.overlord.ForkingTaskRunner - Removing temporary directory: /tmp/persistent/task/index_realtime_delta_netflow_2016-05-12T18:00:00.000Z_0_0/616a5b7b-4ce5-4724-bff1-ba9febaac610

2016-05-12T20:34:47,774 INFO [WorkerTaskMonitor-1] io.druid.indexing.worker.WorkerTaskMonitor - Job’s finished. Completed [index_realtime_delta_netflow_2016-05-12T18:00:00.000Z_0_0] with status [SUCCESS]

2016-05-12T20:36:01,498 INFO [pool-9-thread-4] io.druid.indexing.overlord.ForkingTaskRunner - Process exited with status[0] for task: index_realtime_insights_2016-05-12T18:00:00.000Z_0_0

2016-05-12T20:36:01,499 INFO [pool-9-thread-4] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Writing task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_insights_2016-05-12T18_00_00.000Z_0_0

2016-05-12T20:36:01,848 INFO [pool-9-thread-4] io.druid.storage.hdfs.tasklog.HdfsTaskLogs - Wrote task log to: hdfs:/tmp/druid/indexer/logs/index_realtime_insights_2016-05-12T18_00_00.000Z_0_0

2016-05-12T20:36:01,849 INFO [pool-9-thread-4] io.druid.indexing.overlord.ForkingTaskRunner - Removing temporary directory: /tmp/persistent/task/index_realtime_insights_2016-05-12T18:00:00.000Z_0_0/afe0d4dd-2d9a-4d74-91a3-f4cda085fbf8

2016-05-12T20:36:01,862 INFO [WorkerTaskMonitor-3] io.druid.indexing.worker.WorkerTaskMonitor - Job’s finished. Completed [index_realtime_insights_2016-05-12T18:00:00.000Z_0_0] with status [SUCCESS]

164879.062: [GC164879.062: [ParNew: 18197K->624K(19648K), 0.0038250 secs] 31768K->14196K(63360K), 0.0041280 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]

A bit more info: after rejoining the cluster, the previously partitioned Middle Manager was not quite healthy and we ultimately rebooted it. It eventually picked up some new realtime ingestion tasks, which succeeded, but the following hour’s ingestion tasks never fully started up. The Overlord has a bunch of error logs for the failed task:

2016-05-12T22:55:02,615 INFO [qtp889812032-183] io.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_realtime_observations_2016-05-12T23:00:00.000Z_0_1 with status: TaskStatus{id=index_realtime_observations_2016-05-12T23:00:00.000Z_0_1, status=RUNNING, duration=-1}

2016-05-12T22:55:02,693 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_realtime_observations_2016-05-12T23:00:00.000Z_0_1

2016-05-12T22:55:02,693 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_realtime_observations_2016-05-12T23:00:00.000Z_0_1

2016-05-12T22:55:02,789 INFO [pool-23-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[dwh01.dev.skyportsystems.com:8080] to add task[index_realtime_observations_2016-05-12T23:00:00.000Z_0_1]

2016-05-12T22:55:02,824 INFO [pool-23-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_observations_2016-05-12T23:00:00.000Z_0_1 switched from pending to running (on [dwh01.dev.skyportsystems.com:8080])

2016-05-12T23:00:02,825 ERROR [pool-23-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - Something went wrong! [dwh01.dev.skyportsystems.com:8080] never ran task [index_realtime_observations_2016-05-12T23:00:00.000Z_0_1]! Timeout: (300000 >= PT5M)!

2016-05-12T23:00:02,825 INFO [pool-23-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[dwh01.dev.skyportsystems.com:8080] completed task[index_realtime_observations_2016-05-12T23:00:00.000Z_0_1] with status[FAILED]

2016-05-12T23:00:02,826 INFO [pool-23-thread-1] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_realtime_observations_2016-05-12T23:00:00.000Z_0_1

2016-05-12T23:00:02,827 ERROR [pool-23-thread-1] io.druid.indexing.overlord.RemoteTaskRunner - WTF?! Asked to cleanup nonexistent task: {class=io.druid.indexing.overlord.RemoteTaskRunner, taskId=index_realtime_observations_2016-05-12T23:00:00.000Z_0_1}

2016-05-12T23:00:02,830 INFO [pool-23-thread-1] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_realtime_observations_2016-05-12T23:00:00.000Z_0_1 to status: TaskStatus{id=index_realtime_observations_2016-05-12T23:00:00.000Z_0_1, status=FAILED, duration=-1}

2016-05-12T23:00:02,984 INFO [pool-23-thread-1] io.druid.indexing.overlord.TaskQueue - Task done: RealtimeIndexTask{id=index_realtime_observations_2016-05-12T23:00:00.000Z_0_1, type=index_realtime, dataSource=observations}

2016-05-12T23:00:03,006 INFO [pool-23-thread-1] io.druid.indexing.overlord.TaskQueue - Task FAILED: RealtimeIndexTask{id=index_realtime_observations_2016-05-12T23:00:00.000Z_0_1, type=index_realtime, dataSource=observations} (-1 run duration)

2016-05-12T23:15:13,676 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[dwh01.dev.skyportsystems.com:8080] wrote RUNNING status for task: index_realtime_observations_2016-05-12T23:00:00.000Z_0_1

2016-05-12T23:15:13,676 WARN [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[dwh01.dev.skyportsystems.com:8080] announced a status for a task I didn’t know about, adding to runningTasks: index_realtime_observations_2016-05-12T23:00:00.000Z_0_1

2016-05-12T23:15:22,510 INFO [qtp889812032-183] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_observations_2016-05-12T23:00:00.000Z_0_1]: LockListAction{}

2016-05-12T23:15:25,973 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: dwh01.dev.skyportsystems.com:8080, status 200 OK, response: {“task”:“index_realtime_observations_2016-05-12T23:00:00.000Z_0_1”}

2016-05-12T23:15:25,973 ERROR [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_realtime_observations_2016-05-12T23:00:00.000Z_0_1! Are you sure the task was running?

2016-05-12T23:15:26,239 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[dwh01.dev.skyportsystems.com:8080] wrote FAILED status for task: index_realtime_observations_2016-05-12T23:00:00.000Z_0_1

2016-05-12T23:15:26,239 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[dwh01.dev.skyportsystems.com:8080] completed task[index_realtime_observations_2016-05-12T23:00:00.000Z_0_1] with status[FAILED]

2016-05-12T23:16:12,532 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_realtime_observations_2016-05-12T23:00:00.000Z_0_1] on worker[dwh01.dev.skyportsystems.com:8080]

2016-05-12T23:16:12,598 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_observations_2016-05-12T23:00:00.000Z_0_1] went bye bye.

The Middle Manager’s logs show that the task was received by the peon, but then it looks like it waited 20 minutes before doing anything:

middleManager.stdout-20160513:2016-05-12T22:55:02,831 INFO [TaskMonitorCache-0] io.druid.indexing.worker.WorkerTaskMonitor - Submitting runnable for task[index_realtime_observations_2016-05-12T23:00:00.000Z_0_1]

middleManager.stdout-20160513:2016-05-12T23:15:13,603 INFO [WorkerTaskMonitor-1] io.druid.indexing.worker.WorkerTaskMonitor - Affirmative. Running task [index_realtime_observations_2016-05-12T23:00:00.000Z_0_1]

middleManager.stdout-20160513:2016-05-12T23:15:13,676 INFO [pool-9-thread-2] io.druid.indexing.overlord.ForkingTaskRunner - Running command: …

middleManager.stdout-20160513:2016-05-12T23:15:13,681 INFO [pool-9-thread-2] io.druid.indexing.overlord.ForkingTaskRunner - Logging task index_realtime_observations_2016-05-12T23:00:00.000Z_0_1 output to: /tmp/persistent/task/index_realtime_observations_2016-05-12T23:00:00.000Z_0_1/19eb3ef2-5cd0-4e94-9c54-440cf6861051/log

middleManager.stdout-20160513:2016-05-12T23:15:25,949 INFO [qtp2001782588-103] io.druid.indexing.overlord.ForkingTaskRunner - Killing process for task: index_realtime_observations_2016-05-12T23:00:00.000Z_0_1

middleManager.stdout-20160513:2016-05-12T23:15:26,216 INFO [pool-9-thread-2] io.druid.indexing.overlord.ForkingTaskRunner - Removing temporary directory: /tmp/persistent/task/index_realtime_observations_2016-05-12T23:00:00.000Z_0_1/19eb3ef2-5cd0-4e94-9c54-440cf6861051

middleManager.stdout-20160513:2016-05-12T23:15:26,218 ERROR [WorkerTaskMonitor-1] io.druid.indexing.worker.WorkerTaskMonitor - I can’t build there. Failed to run task: {class=io.druid.indexing.worker.WorkerTaskMonitor, exceptionType=class java.util.concurrent.ExecutionException, exceptionMessage=java.lang.RuntimeException: java.io.IOException: Stream closed, task=index_realtime_observations_2016-05-12T23:00:00.000Z_0_1}

middleManager.stdout-20160513:2016-05-12T23:15:26,236 INFO [WorkerTaskMonitor-1] io.druid.indexing.worker.WorkerTaskMonitor - Job’s finished. Completed [index_realtime_observations_2016-05-12T23:00:00.000Z_0_1] with status [FAILED]

Hey TJ,

Sorry I haven’t got back to you for a while.

I think you could be seeing this bug: https://github.com/druid-io/druid/pull/1868. It was fixed in 0.8.2.

Ok, cool, we’ll be updating to Druid 0.9.0 soon, so this should go away. Thanks!

–T