Zookeeper issues with Druid 0.16-incubating

Hi,

We provisioned a new druid cluster (2 historicals, 1 middlemanager and 1 broker). The azure blob deep storage already has around 100k segments. When we start the historical nodes, it eventually results in OOM error on the Zookeeper nodes (even with 512 Mb assigned to each Zookeeper process). The Zookeeper znode counts are more than 100k and it doesn’t come down. We also see the following errors in the historical logs. Please help.

""2019-11-11T16:05:02,035 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED

""2019-11-11T16:05:02,036 INFO [ZkCoordinator] org.apache.druid.server.coordination.ZkCoordinator - Ignoring event[PathChildrenCacheEvent{type=CONNECTION_RECONNECTED, data=null}]

org.apache.zookeeper.ClientCnxn - Session 0x36e5b1ecd8c0000 for server, unexpected error, closing socket connection and attempting reconnect

" java.io.IOException: Packet len5848381 is out of range!

at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]

at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]

at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) [zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]

"2019-11-11T16:05:02,171 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED

""2019-11-11T16:05:02,172 INFO [ZkCoordinator] org.apache.druid.server.coordination.ZkCoordinator - Ignoring event[PathChildrenCacheEvent{type=CONNECTION_SUSPENDED, data=null}]

Regards,

-Anand

Hi Anand,
As per https://github.com/ksprojects/zkcopy/issues/12, it is suggested to try increasing -Djute.maxbuffer=536870912

Thank you.

–siva

Thank you for the reply, Siva. However, 546 Mb of maxbuffer looks a lot. It is generally suggested to keep it in to order of 5-10 Mb. In any case, the connections from the historical to the zookeeper keeps dropping frequently.

Regards,

-Anand

Hi Anand,
You don’t need to go all the way till 546MB. Can you try to increase it in small increments in couple of iterations to observe where you see stable connectivity .

Thanks,

–siva