Router crash, deconnection with brokers

Relates to Apache Druid 0.21.1

We’re facing every some crash of the Router + Brokers. Unable to execute request and Druid console unavailable (web app is there but unable to load info on it).

We’ve this kind of error in router log:

2021-12-14T03:02:49,886 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] detected.
2021-12-14T03:05:11,540 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] went offline.
2021-12-14T03:06:21,909 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] detected.
2021-12-14T03:08:01,542 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] went offline.
2021-12-14T04:41:26,203 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] detected.
2021-12-14T04:43:57,543 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] went offline.
2021-12-14T04:54:15,879 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] detected.
2021-12-14T04:58:13,543 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] went offline.
2021-12-14T05:12:35,956 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] detected.
2021-12-14T05:14:39,543 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] went offline.
2021-12-14T05:14:45,304 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] detected.
2021-12-14T05:18:07,543 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] went offline.
2021-12-14T07:32:42,372 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] detected.
2021-12-14T07:32:42,382 INFO [NodeRoleWatcher[BROKER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://10.244.29.80:8082] of role[broker] went offline.

How to investigate deeper on this deconnection between router and brokers ?

Is it safe to presume you’re not using Zookeeper here?

Sorry that answer was a fragment hahaha

If you are, maybe check your ZK networking stuff.

Another area I might investigate is your Jetty thread capacities in case this is because you’re just running out of HTTP threads to listen on between servers.

No we use it… but the root cause is probably there if I understood well your answer :wink:

Yes and indeed some log liek this in ZK:

2021-12-13 10:57:26,559 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x10497351be90006, likely client has closed socket
2021-12-13 11:01:46,036 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x10497351be90007, likely client has closed socket
2021-12-13 11:05:58,261 [myid:1] - WARN  [NIOWorkerThread-2:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x104ba324dda0003, likely client has closed socket
2021-12-13 12:01:47,246 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x104ba324dda0003, likely client has closed socket

You might find some useful stuff in here – maybe increasing the timeout?

1 Like

/me stabs in the dark like a mad person