Middle Manager keeps losing connection with Zookeeper

Hi all,

My hourly HDFS ingestion task is often stalling holding up pending ingestion tasks. Investigating the task logs, shows the Middle Manager keeps timing out with Zookeeper.

2017-06-20T15:34:07,928 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 100%

2017-06-20T15:55:28,462 WARN [main-SendThread(nj-db20.acuityads.org:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 45574ms for sessionid 0x25ca71278750b21

2017-06-20T15:55:32,961 INFO [main-SendThread(nj-db20.acuityads.org:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 45574ms for sessionid 0x25ca71278750b21, closing socket connection and attempting reconnect

2017-06-20T16:04:14,402 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED

2017-06-20T16:14:38,668 WARN [main-EventThread] org.apache.curator.ConnectionState - Connection attempt unsuccessful after 387978 (greater than max timeout of 120000). Resetting connection and trying again with a new connection.

2017-06-20T16:12:43,941 INFO [main-SendThread(nj-db21.acuityads.org:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server nj-db21.acuityads.org/10.65.17.21:2181. Will not attempt to authenticate using SASL (unknown error)

2017-06-20T16:29:21,951 INFO [main-SendThread(nj-db21.acuityads.org:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to nj-db21.acuityads.org/10.65.17.21:2181, initiating session

2017-06-20T16:32:28,211 WARN [main-SendThread(nj-db21.acuityads.org:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 406879ms for sessionid 0x25ca71278750b21

2017-06-20T16:38:49,698 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper - Session: 0x25ca71278750b21 closed

2017-06-20T16:52:53,336 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString:2181=nj-db20.acuityads.org,nj-db21.acuityads.org:2181,nj-db19.acuityads.org:2181 sessionTimeout=120000 watcher=org.apache.curator.ConnectionState@60dc1a4e

My Middle Manager is running on its own node with 24 Cores and 380 GB RAM and I’ve allocated 10 GB Heap for its JVM. Is there a reason why the Middle Manager is unable to keep its connection with Zookeeper?

Hi, Aaron.

Pending ingestion tasks means you can see pending tasks in overlord console?

How did you set druid.worker.capacity in middleManager runtime configuration and how many running task per middleManager at the time on pending task exist?

Just to add to @GunWoo reply, 10 GB of heap for Middlemanager heap is an overkill, the middlemanager simply announces itself in ZK, spawn peons and manages them.
generally MM will work fine with even 512M of JVM heap.

hi , whats your zk version?