Druid (0.9.1) failing on single zookeeper failure?

Hey all,

We current having a cluster of 5 zookeeper servers running with druid setup to connect to all 5,

Eg. druid.zk.service.host=zk1,zk2,…zk5

Earlier today an instance of zk (ZK1) in our prod environment went down and took druid ingestion down with it…

All requests to the overlord ended up returning 500 responses with errors that zk1 was unreachable (though zk2-5 were functional). We have about 7 other systems connected to that zookeeper cluster and none of them were effected by the single zk node failure.

Is this expected behavior or do we have something miss configured with druid?

Thanks in advance.

Hi please make sure that the connection string format is correct

https://zookeeper.apache.org/doc/r3.2.2/api/org/apache/zookeeper/ZooKeeper.html#ZooKeeper(java.lang.String,%20int,%20org.apache.zookeeper.Watcher)

e.g. “127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:3002”

please note that parsing the connection string is done outside of druid

I am not sure what is exactly happening within the curator framework used by druid but there is some magical stuff happening when parsing that string.

For instance a connection string like <fakehost.com,localhost> will work fine (you can do this in your local machine, by just running local zookeeper)

BUT the connection string <fakehost,localhost> will not work without the [.COM] !! my guess is that there is some regex mismatch to split the string.

The conclusion i guess you should be fine if using IP:PORT,IP:PORT