Broker Service is stalled, complaining about zookeeper session/connection timeout

Relates to Apache Druid 0.22.1

We recently have a spike in our Druid Query submissions. When we have multiple concurrent queries running, we are seeing issues with our Broker service stalled.
Upon investigation, on the query nodes, we noticed the connection to zookeeper service is timing out (See below Logs from Query Nodes). At the same time, the coordinator service, is complaining
about a failed lookup managerment for the same query node. When we run “htop” command, on the query nodes, we are seeing a high cpu usage - while this issue persists.

On the zookeeper nodes, we are not seeing any errors in the log files. There is no significant spike in CPU or MEMORY usage.

Our current mitigation solution is, to force restart the query node services (broker and router). Any ideas, on what we can do to further troubleshoot this issue.
We prefer to fix the underlying issue rather than resort to Restarting the service, everytime this happens.

LOGS FROM QUERY NODE:

2022-03-03T16:31:06,977 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=druid-zk-1.something.com,druid-zk-2.something.com,druid-zk-3.something.com,druid-zk-4.something.com,druid-zk-5.something.com sessi
onTimeout=30000 watcher=org.apache.curator.ConnectionState@31082efa
2022-03-03T16:31:06,977 INFO [main-EventThread] org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
2022-03-03T16:31:06,977 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=
2022-03-03T16:31:06,977 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: LOST
2022-03-03T16:31:06,978 WARN [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_LOST] for node watcher of role[coordinator].
2022-03-03T16:31:06,978 WARN [NodeRoleWatcher[OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_LOST] for node watcher of role[overlord].
2022-03-03T16:31:06,978 INFO [Announcer-0] org.apache.druid.curator.announcement.Announcer - Node[/druid-prod-gamma/internal-discovery/BROKER/druid-prod-query-3206.something.com:8082] is added to reinstate.
2022-03-03T16:31:06,978 INFO [ZkCoordinator] org.apache.druid.server.coordination.ZkCoordinator - Ignoring event[PathChildrenCacheEvent{type=CONNECTION_LOST, data=null}]
2022-03-03T16:31:06,978 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x300b07090803709
2022-03-03T16:31:06,996 INFO [main-SendThread(druid-zk-4.something.com:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server druid-zk-4.something.com/10.220.8.120:2181. Will not attempt to authenticate using SASL (unknown error)
2022-03-03T16:31:06,996 INFO [main-SendThread(druid-zk-4.something.com:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.220.33.129:56538, server: druid-zk-4.something.com/10.220.8.120:2181
2022-03-03T16:31:06,999 INFO [main-SendThread(druid-zk-4.something.com:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server druid-zk-4.something.com/10.220.8.120:2181, sessionid = 0x400b0708a9436c9, negotiated timeout = 30000
2022-03-03T16:31:06,999 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED
2022-03-03T16:31:07,000 WARN [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_RECONNECTED] for node watcher of role[coordinator].
2022-03-03T16:31:07,000 WARN [NodeRoleWatcher[OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_RECONNECTED] for node watcher of role[overlord].
2022-03-03T16:31:07,001 INFO [Announcer-0] org.apache.druid.curator.announcement.Announcer - Node[/druid-prod-gamma/internal-discovery/BROKER/druid-prod-query-3206.something.com:8082] dropped, reinstating.
2022-03-03T16:31:07,003 INFO [Announcer-0] org.apache.druid.curator.announcement.Announcer - Reinstating [/druid-prod-gamma/internal-discovery/BROKER/druid-prod-query-3206.something.com:8082]
2022-03-03T16:31:07,003 INFO [ZkCoordinator] org.apache.druid.server.coordination.ZkCoordinator - Ignoring event[PathChildrenCacheEvent{type=CONNECTION_RECONNECTED, data=null}]
2022-03-03T16:31:07,004 INFO [main-EventThread] org.apache.curator.framework.imps.EnsembleTracker - New config event received: {server.1=druid-zk-1.something.com:2888:3888:participant, version=0, server.5=druid-zk-5.something.com:2888:3888:participant, server.4=ae-druid
-zk-1004.something.com:2888:3888:participant, server.3=druid-zk-3.something.com:2888:3888:participant, server.2=druid-zk-2.something.com:2888:3888:participant}
2022-03-03T16:31:07,018 ERROR [main-EventThread] org.apache.curator.framework.imps.EnsembleTracker - Invalid config event received: {server.1=druid-zk-1.something.com:2888:3888:participant, version=0, server.5=druid-zk-5.something.com:2888:3888:participant, server.4=ae-
druid-zk-4.something.com:2888:3888:participant, server.3=druid-zk-3.something.com:2888:3888:participant, server.2=druid-zk-2.something.com:2888:3888:participant}
2022-03-03T16:33:36,228 WARN [main-SendThread(druid-zk-4.something.com:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 148909ms for sessionid 0x400b0708a9436c9
2022-03-03T16:33:36,228 INFO [main-SendThread(druid-zk-4.something.com:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 148909ms for sessionid 0x400b0708a9436c9, closing socket connection and attempting reconnect
2022-03-03T16:33:36,332 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
2022-03-03T16:33:36,332 WARN [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_SUSPENDED] for node watcher of role[coordinator].
2022-03-03T16:33:36,332 WARN [NodeRoleWatcher[OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_SUSPENDED] for node watcher of role[overlord].
2022-03-03T16:33:36,332 INFO [ZkCoordinator] org.apache.druid.server.coordination.ZkCoordinator - Ignoring event[PathChildrenCacheEvent{type=CONNECTION_SUSPENDED, data=null}]
2022-03-03T16:33:36,399 INFO [main-SendThread(druid-zk-3.something.com:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server druid-zk-3.something.com/10.220.10.80:2181. Will not attempt to authenticate using SASL (unknown error)
2022-03-03T16:33:36,400 INFO [main-SendThread(druid-zk-3.something.com:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.220.33.129:53332, server: druid-zk-3.something.com/10.220.10.80:2181
2022-03-03T16:33:36,404 WARN [main-SendThread(druid-zk-3.something.com:2181)] org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x400b0708a9436c9 has expired
2022-03-03T16:33:36,404 INFO [main-SendThread(druid-zk-3.something.com:2181)] org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x400b0708a9436c9 has expired, closing socket connection
2022-03-03T16:33:36,404 WARN [main-EventThread] org.apache.curator.ConnectionState - Session expired event received

LOGS FROM COORDINATOR NODE:

2022-03-03T16:25:30,865 WARN [HttpClient-Netty-Worker-6] org.apache.druid.java.util.http.client.pool.ResourcePool - Resource at key[http://druid-prod-query-3203.bdp.roku.com:8082] was returned multiple times?
2022-03-03T16:25:30,865 ERROR [LookupCoordinatorManager--9] org.apache.druid.server.lookup.cache.LookupCoordinatorManager - Failed to finish lookup management on node [http:druid-prod-query-3203.bdp.roku.com:8082]: {class=org.apache.druid.server.lookup.cache.LookupCoordinator
Manager, exceptionType=class java.util.concurrent.ExecutionException, exceptionMessage=org.jboss.netty.handler.timeout.ReadTimeoutException}
java.util.concurrent.ExecutionException: org.jboss.netty.handler.timeout.ReadTimeoutException
        at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[guava-16.0.1.jar:?]
        at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[guava-16.0.1.jar:?]
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]
        at org.apache.druid.server.lookup.cache.LookupCoordinatorManager$LookupsCommunicator.getLookupStateForNode(LookupCoordinatorManager.java:855) ~[druid-server-0.22.1.jar:0.22.1]
        at org.apache.druid.server.lookup.cache.LookupCoordinatorManager.doLookupManagementOnNode(LookupCoordinatorManager.java:646) ~[druid-server-0.22.1.jar:0.22.1]
        at org.apache.druid.server.lookup.cache.LookupCoordinatorManager.lambda$lookupManagementLoop$2(LookupCoordinatorManager.java:590) ~[druid-server-0.22.1.jar:0.22.1]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_312]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_312]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_312]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_312]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_312]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
Caused by: org.jboss.netty.handler.timeout.ReadTimeoutException
        at org.jboss.netty.handler.timeout.ReadTimeoutHandler.<clinit>(ReadTimeoutHandler.java:84) ~[netty-3.10.6.Final.jar:?]
        at org.apache.druid.java.util.http.client.NettyHttpClient.go(NettyHttpClient.java:172) ~[druid-core-0.22.1.jar:0.22.1]
        at org.apache.druid.server.lookup.cache.LookupCoordinatorManager$LookupsCommunicator.getLookupStateForNode(LookupCoordinatorManager.java:850) ~[druid-server-0.22.1.jar:0.22.1]
        ... 10 more
2022-03-03T16:31:33,164 WARN [HttpClient-Netty-Worker-35] org.apache.druid.java.util.http.client.pool.ResourcePool - Resource at key[http://druid-prod-query-3204.bdp.roku.com:8082] was returned multiple times?
2022-03-03T16:31:33,164 WARN [HttpClient-Netty-Worker-29] org.apache.druid.java.util.http.client.pool.ResourcePool - Resource at key[http://druid-prod-query-3206.bdp.roku.com:8082] was returned multiple times?
2022-03-03T16:31:33,164 ERROR [LookupCoordinatorManager--4] org.apache.druid.server.lookup.cache.LookupCoordinatorManager - Failed to finish lookup management on node [http:druid-prod-query-3206.bdp.roku.com:8082]: {class=org.apache.druid.server.lookup.cache.LookupCoordinator
Manager, exceptionType=class java.util.concurrent.ExecutionException, exceptionMessage=org.jboss.netty.handler.timeout.ReadTimeoutException}
java.util.concurrent.ExecutionException: org.jboss.netty.handler.timeout.ReadTimeoutException
        at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[guava-16.0.1.jar:?]
        at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[guava-16.0.1.jar:?]
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]
        at org.apache.druid.server.lookup.cache.LookupCoordinatorManager$LookupsCommunicator.getLookupStateForNode(LookupCoordinatorManager.java:855) ~[druid-server-0.22.1.jar:0.22.1]
        at org.apache.druid.server.lookup.cache.LookupCoordinatorManager.doLookupManagementOnNode(LookupCoordinatorManager.java:646) ~[druid-server-0.22.1.jar:0.22.1]
        at org.apache.druid.server.lookup.cache.LookupCoordinatorManager.lambda$lookupManagementLoop$2(LookupCoordinatorManager.java:590) ~[druid-server-0.22.1.jar:0.22.1]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_312]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_312]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_312]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_312]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_312]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
Caused by: org.jboss.netty.handler.timeout.ReadTimeoutException
        at org.jboss.netty.handler.timeout.ReadTimeoutHandler.<clinit>(ReadTimeoutHandler.java:84) ~[netty-3.10.6.Final.jar:?]
        at org.apache.druid.java.util.http.client.NettyHttpClient.go(NettyHttpClient.java:172) ~[druid-core-0.22.1.jar:0.22.1]
        at org.apache.druid.server.lookup.cache.LookupCoordinatorManager$LookupsCommunicator.getLookupStateForNode(LookupCoordinatorManager.java:850) ~[druid-server-0.22.1.jar:0.22.1]
        ... 10 more

Hi krishnat. I might start by looking at druid.server.http.defaultQueryTimeout, druid.broker.http.numConnections and druid.server.http.numThreads.