Historical nodes failing to connect to zookeeper once a day

Hi all,

we are having big troubles with druid since last week. Every morning our historical nodes start to fail to connect to ZK, we see this errors.

2016-12-05 07:52:24,378 ERROR(CuratorFrameworkImpl.java:566): Background operation retry gave up
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) ~[zookeeper-3.4.8.jar:3.4.8–1]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728) [curator-framework-2.10.0.jar:?]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:516) [curator-framework-2.10.0.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:565) [curator-framework-2.10.0.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl.access$900(CreateBuilderImpl.java:44) [curator-framework-2.10.0.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:524) [curator-framework-2.10.0.jar:?]
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:613) [zookeeper-3.4.8.jar:3.4.8–1]
at org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn.java:485) [zookeeper-3.4.8.jar:3.4.8–1]
at org.apache.zookeeper.ClientCnxn.finishPacket(ClientCnxn.java:655) [zookeeper-3.4.8.jar:3.4.8–1]
at org.apache.zookeeper.ClientCnxn.conLossPacket(ClientCnxn.java:673) [zookeeper-3.4.8.jar:3.4.8–1]
at org.apache.zookeeper.ClientCnxn.access$2300(ClientCnxn.java:90) [zookeeper-3.4.8.jar:3.4.8–1]
at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1255) [zookeeper-3.4.8.jar:3.4.8–1]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1170) [zookeeper-3.4.8.jar:3.4.8–1]

We have to restart them and also delete the segment-cache folder, because if we don’t delete the, it continues to fail… And we cannot query to them, and the realtime tasks (kafka indexing service) can’t persist the data.

Unfortunately we have too many segments (thousands) per dataSource, and we are trying to reduce this number by reindexing with DAY granularity, but there’s still a long way to go and we cannot run reindexing services while historicals are having so many troubles.

Also, this morning we had to restart zookeeper, because the kafka indexing tasks weren’t able to start properly. They started and got killed immediately by the overlord, not sure why.

In zookeeper.out we see some errors like these:

2016-12-05 11:29:59,192 [myid:3] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.lang.Thread.run(Thread.java:745)

Is this only a zk problem? What can we change? We have zookeeper in 3 nodes, one with a druid broker, and 2 with one druid overlord and one coordinator each.

Thank you, any help will be appreciated.

Using: druid 0.9.1.1, zookeeper 3.4.8

Fede,

And also, in zookeeper.out we are seeing this…

We seem to have solved this.

We did two things:

  • Change the _default rule of the coordinator to replica 1. Replica 2 sent too many requests to zk.

  • Set used=0 for the corrupted segments in the metadata db. Coordinator kept trying to load them, and historicals failed every time.

Decreasing the number of zk requests with these 2 changes made it stable.

Hi,

I’m also facing the same issue.

Would you please share the step to resolve this issue?

Thanks,
Rajesh

Hey Rajesh,

This issue seems to be due to heavy load on ZK causing instability on ZK, which spills over into instability in Druid. You should be able to fix it by reducing load on ZK or by giving ZK more powerful hardware.