Router is failing to start due to failure to connect to coordinator

We are trying to install Druid in cluster mode with 3 machines - one each for master, query and data with a common 3-node zookeeper cluster.

Master processes and data processes have started without any issue, but query services are failing to start. Broker process has come up and no error is found in the log. Router process has the following error trace.

2022-12-06T12:09:23,908 WARN [CoordinatorRuleManager-Exec--0] org.apache.druid.discovery.DruidLeaderClient - Request[http://localhost:8081/druid/coordinator/v1/rules] failed.
org.jboss.netty.channel.ChannelException: Faulty channel in resource pool
        at org.apache.druid.java.util.http.client.NettyHttpClient.go(NettyHttpClient.java:134) ~[druid-core-24.0.1.jar:24.0.1]
        at org.apache.druid.java.util.http.client.AbstractHttpClient.go(AbstractHttpClient.java:33) ~[druid-core-24.0.1.jar:24.0.1]
        at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:143) ~[druid-server-24.0.1.jar:24.0.1]
        at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:127) ~[druid-server-24.0.1.jar:24.0.1]
        at org.apache.druid.server.router.CoordinatorRuleManager.poll(CoordinatorRuleManager.java:137) ~[druid-services-24.0.1.jar:24.0.1]
        at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:55) ~[druid-core-24.0.1.jar:24.0.1]
        at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$1.call(ScheduledExecutors.java:51) ~[druid-core-24.0.1.jar:24.0.1]
        at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:97) ~[druid-core-24.0.1.jar:24.0.1]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_322]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_322]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_322]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_322]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_322]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_322]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_322]
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:8081
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_322]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) ~[?:1.8.0_322]
        at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) ~[netty-3.10.6.Final.jar:?]
        ... 3 more
2022-12-06T12:09:23,912 WARN [HttpClient-Netty-Boss-0] org.jboss.netty.channel.SimpleChannelUpstreamHandler - EXCEPTION, please implement org.jboss.netty.handler.codec.http.HttpContentDecompressor.exceptionCaught() for proper handling.
java.net.ConnectException: Connection refused: localhost/127.0.0.1:8081
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_322]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) ~[?:1.8.0_322]
        at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) ~[netty-3.10.6.Final.jar:?]
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) ~[netty-3.10.6.Final.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_322]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_322]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_322]

From the logs, we gather that it is trying to connect to coordinator running locally (127.0.0.1), but coordinator is running on master. The same request is giving response if the localhost is substituted with master’s IP address.

curl -XGET http://<master-ip>:8081/druid/coordinator/v1/rules
{"_default":[{"tieredReplicants":{"_default_tier":2},"type":"loadForever"}]}

There is no trace of localhost anywhere in the query conference directory <>/conf/druid/cluster/query/.

Any pointers on what is being missed here is appreciated.

This is the conf entries in the file

druid.service=druid/router
druid.plaintextPort=8888
druid.router.http.numConnections=50
druid.router.http.readTimeout=PT5M
druid.router.http.numMaxThreads=100
druid.server.http.numThreads=100
druid.router.defaultBrokerServiceName=druid/broker
druid.router.coordinatorServiceName=druid/coordinator
druid.router.managementProxy.enabled=true

Trouble shooting steps that we have tried.

  • Curl-ing the request to master manually. This gives a success response.
  • adding druid.host=<master-ip> in the query configuration file.

Relates to Apache Druid 24.0.1

All the servers announce themselves through zookeeper. So this would seem to be a zookeeper config issue. Have you configured zookeeper hosts in the common runtime properties?

Thanks @Sergio_Ferragut for the response.

Zookeeper properties are set correctly in the file conf/druid/cluster/_common/common.runtime.properties

druid.zk.service.host=<zoo-node-1>:2181,<zoo-node-2>:2181,<zoo-node-3>:2181
druid.zk.paths.base=/druid

The same is set in both master and data nodes. So this is what is expected, right @Sergio_Ferragut ?

Have you run this setup earlier with the same zk in single node? If so you can try using the zkcli to remove /druid (shut down druid services before doing this). restarting the cluster will populate /druid again in zk.

2 Likes

Many thanks, @Vijay_Narayanan1 and apologies on the delayed response.

This did the trick - the same zk cluster was used for a single node creation, deleting the zk folder solved the issue and saved my day! :slight_smile:

1 Like