org.apache.druid.java.util.common.IOE: No known server - issue when switching between co-ordinators

I have setup druid cluster (0.21 version) with 2 coordinators , 3 Zk nodes ,2 (Query servers), 6 (HH & MM) - All services are on individual nodes .

When i start the cluster with single cordinator , all the servcies comeup good and there no erros on the admin console . When i bring the other coordinator up and stop the prveious cooridnator , I see everything failing and i see the below error messages on the console , on middle managers and my services do not startup . I am suspecting that the ZK is not doing proper leader election or something else might be wrong . also when i bring up the other coordinator up as well , still things woudl not start normall and when i check the api to find out the leader , it shows both the coordinators as leaders . Please help , breaking my head overthis for 3 weeks now .

ERROR [BasicAuthenticatorCacheManager-Exec--0] org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager - Encountered exception while fetching user map for authenticator [ldap]: {class=org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager, exceptionType=class org.apache.druid.java.util.common.IOE, exceptionMessage=No known server}
org.apache.druid.java.util.common.IOE: No known server
        at org.apache.druid.discovery.DruidLeaderClient.getCurrentKnownLeader(DruidLeaderClient.java:267) ~[druid-server-0.21.0.jar:0.21.0]
        at org.apache.druid.discovery.DruidLeaderClient.makeRequest(DruidLeaderClient.java:122) ~[druid-server-0.21.0.jar:0.21.0]
        at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.tryFetchUserMapFromCoordinator(CoordinatorPollingBasicAuthenticatorCacheManager.java:252) ~[?:?]
        at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.lambda$fetchUserMapFromCoordinator$1(CoordinatorPollingBasicAuthenticatorCacheManager.java:192) ~[?:?]
        at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:87) ~[druid-core-0.21.0.jar:0.21.0]
        at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:115) ~[druid-core-0.21.0.jar:0.21.0]
        at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:105) ~[druid-core-0.21.0.jar:0.21.0]
        at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.fetchUserMapFromCoordinator(CoordinatorPollingBasicAuthenticatorCacheManager.java:190) ~[?:?]
        at org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager.lambda$start$0(CoordinatorPollingBasicAuthenticatorCacheManager.java:122) ~[?:?]
        at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:55) [druid-core-0.21.0.jar:0.21.0]
        at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:51) [druid-core-0.21.0.jar:0.21.0]
        at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:97) [druid-core-0.21.0.jar:0.21.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_262]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_262]

Hey @vishalth !

AFAIK leader election is totally in the control of Zookeeper via latches, so I would suggest doing some digging into the Zookeeper element of the cluster, confirming things like whether both coordinators have the correct Zookeeper entries, and that the Zookeepers can all talk to one another OK.

Because of the no known server entry in your log, I’d check that all your node names are dereferencable from all your servers, too.

Thanks was able to resolve the issue after reconfiguring the zookeeper .