Peon tasks stop receiving data after sometime. Zk issue?

Hi,
In my current setup, there are times when data is not being indexed in middle managers for one full hour and suddenly it will start indexing from the next hour when new jobs are spawnes. The time it happens is very erratic and its hard to find out whats going on. During the hour in which the data is not sent, the tranquility sender prints these messages continuously.

228746 [ClusteredBeam-ZkFuturePool-14d725f8-5834-4dd4-a9dc-4721a2d96954] INFO com.metamx.tranquility.beam.ClusteredBeam - Global latestCloseTime[2016-02-08T02:00:00.000-07:00] for identifier[overlord/vidmessage] has moved past timestamp[2016-02-08T02:00:00.000-07:00], not creating merged beam

228748 [ClusteredBeam-ZkFuturePool-14d725f8-5834-4dd4-a9dc-4721a2d96954] INFO com.metamx.tranquility.beam.ClusteredBeam - Turns out we decided not to actually make beams for identifier[overlord/vidmessage] timestamp[2016-02-08T02:00:00.000-07:00]. Returning None.

228791 [ClusteredBeam-ZkFuturePool-14d725f8-5834-4dd4-a9dc-4721a2d96954] INFO com.metamx.tranquility.beam.ClusteredBeam - Global latestCloseTime[2016-02-08T02:00:00.000-07:00] for identifier[overlord/vidmessage] has moved past timestamp[2016-02-08T02:00:00.000-07:00], not creating merged beam

228792 [ClusteredBeam-ZkFuturePool-14d725f8-5834-4dd4-a9dc-4721a2d96954] INFO com.metamx.tranquility.beam.ClusteredBeam - Turns out we decided not to actually make beams for identifier[overlord/vidmessage] timestamp[2016-02-08T02:00:00.000-07:00]. Returning None.

In this current scenario, the indexing stopped 20 minutes into the hour, the middle manager log for one of the tasks shows logs that it persists the segments for every 100,000 rows but after the 20th minute I just see a lot of server disappeared messages. The full log is attached to the thread. What could be the problem? Has it got something to do with the zookeeper?

Thanks,

Ram

task payload.json (1.75 KB)

peon logs.log (1.07 MB)

The version of druid is 0.8.2 and the tranquility version is 0.4.2

I have already gone through this post https://groups.google.com/forum/#!topic/druid-development/t2AYsr3ZfX8 and this suggest that we run with 2 replicas. But in my case the job isnt defunct, it succeeds but they stop receiving anything and tranquility thinks that it has decided not to create a tasks eventhough they were sending data a few seconds ago

For anyone who is facing a similar problem. I fixed this by using a better overlord node and upgraded the tranquility version to 0.7.2. It was mostly because the overlord was very low pwered device and caused a lot of connection resets everywhere.

Glad you figured it out :slight_smile:

Hi Fangjin,

I’m experiencing the same issue and it happens sometimes unexpectedly. What’s possibly the root cause of this error?