We current having a cluster of 5 zookeeper servers running with druid setup to connect to all 5,
Earlier today an instance of zk (ZK1) in our prod environment went down and took druid ingestion down with it…
All requests to the overlord ended up returning 500 responses with errors that zk1 was unreachable (though zk2-5 were functional). We have about 7 other systems connected to that zookeeper cluster and none of them were effected by the single zk node failure.
Is this expected behavior or do we have something miss configured with druid?
Thanks in advance.