Zookeeper CancelledKeyException while indexing

Hey,

I’m using the latest implydata quickstart setup to prototype new things in staging environment. I used to submit 4 hadoop indexing tasks in parallel on m4.2xlarge instance without any problems using s3 storage, but currently if I submit just 2 tasks concurrently on the identical data I keep getting these Zookeeper exceptions and the only presage is that IoWait extremely increases even though it barely touches disks and bandwidth is at 50% …

java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1082)
at org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1119)
at org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:120)
at org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:92)
at org.apache.zookeeper.server.DataTree.setData(DataTree.java:620)
at org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:807)
at org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:329)
at org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:1026)
at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:116)
at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)

``

From that point all the indexing tasks start failing without any error or exception in the task logs or any of druid’s logs

I was digging into it for a few hours but no change seem to be effective … Any idea ?

Hi Jakub, where do you see this error from?

Are there no errors in the Druid overlord or task logs?

Hi Fangjin,

I’m using the implydata quickstart setup, which runs everything in a single container, this exceptions comes from its stdout, all logs are forwarded into it, including zookeepers.

As I mentioned there are no additional errors anywhere, not even in task logs. But everytime I saw this exception the task that was currently executed ended up with FAILED status…

However it stopped happening later, I guess it could be caused by ingesting some “heavier” segments. Later I was able to use 4 tasks concurrently without a problem…

So it’s ok now.

The imply quickstart should have dedicated logs for every single Druid process. These logs are stored as files in directories. For example, if you have a task fail, can you access the overlord console located at http://localhost:8090/console.html, click on the task log, and include that here?