Curator 2.6.0- Background operation retry gave up after zookeeper goes down

Hi Team,

We have been testing Druid environment but getting below errors in druid nodes when any of zookeeper goes down and even if comes back.

org.apache.curator.framework.imps.CuratorFrameworkImpl - Background operation retry gave up

In cluster we have

Machine 1 - zookeeper, RT, Historical

Machine 2 - zookeeper, Historical

Machine 3 - zookeeper, Historical, Indexer, Coordinator

Machine 4 - kafka, Historical

Druid 0.6.146

zookeeper 3.4.6

I found similar issues in topic here

https://groups.google.com/forum/#!searchin/druid-development/org.apache.curator.framework.imps.CuratorFrameworkImpl$20-$20Background$20retry$20gave$20up/druid-development/s--zH5cvjI0/72acl86a72UJ. At the end discusstion talks about announcer.type to batch. But I suppose that’s default in druid 0.6.146

Any help on this?

Hi Parampreet,

The default announcer in 0.6.146 is not using batch announcements. That only became the default is 0.7.0. I recall seeing the ZK issues you describe in older versions of Druid, but we did make multiple fixes for a variety of Druid/ZK related issues since 0.6.146. I would recommend updating to 0.7.1.1 and seeing if you still have the same problems.

– FJ

Hi Parampreet,

I recommend you should also go through -

http://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_designing

http://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#Single+Machine+Requirements

In general its not a good idea to run zookeeper in production alongwith other services on single machine as it can potentially lead to resource contention and significant performance degradation.

If you can move zookeeper to separate machines it will also be helpful.