Still waiting for Handoff for Segments

Hello,

I have upgraded druid from 0.9.2 to 0.10.0, but after upgrading, the real time index task is running forever, can not be finished, the peon log says:


2017-04-30T04:42:58,043 INFO [coordinator_handoff_scheduled_0] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://druid-prod01:8081

2017-04-30T04:42:58,054 INFO [coordinator_handoff_scheduled_0] io.druid.segment.realtime.plumber.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for Segments : [[SegmentDescriptor{interval=2017-04-30T02:00:00.000Z/2017-04-30T03:00:00.000Z, version='2017-04-30T02:41:58.367Z', partitionNumber=0}]]

I have checked the meta data, and found the the segment in druid_segments, this is the payload:


{

"dataSource": "TraceLog",

"interval": "2017-04-30T02:00:00.000Z/2017-04-30T03:00:00.000Z",

"version": "2017-04-30T02:41:58.367Z",

"loadSpec": {

"type": "hdfs",

"path": "hdfs://HAservice/druid-prod/segments/TraceLog/20170430T020000.000Z_20170430T030000.000Z/2017-04-30T02_41_58.367Z/0_index.zip"

},

"dimensions": "hostname",

"metrics": "count,cost",

"shardSpec": {

"type": "linear",

"partitionNum": 0

},

"binaryVersion": 9,

"size": 24614914,

"identifier": "TraceLog_2017-04-30T02:00:00.000Z_2017-04-30T03:00:00.000Z_2017-04-30T02:41:58.367Z"

}

i checked the files in HDFS, files:


/druid-prod/segments/TraceLog/20170430T020000.000Z_20170430T030000.000Z/2017-04-30T02_41_58.367Z/0_index.zip

/druid-prod/segments/TraceLog/20170430T020000.000Z_20170430T030000.000Z/2017-04-30T02_41_58.367Z/0_descriptor.json

are uploaded correctly

in coordinator console, i found the segments is listed in the datasource, this is the metadata i get from coordinator’s console:


{

"metadata": {

"dataSource": "TraceLog",

"interval": "2017-04-30T02:00:00.000Z/2017-04-30T03:00:00.000Z",

"version": "2017-04-30T02:41:58.367Z",

"loadSpec": {},

"dimensions": "",

"metrics": "count,cost",

"shardSpec": {

"type": "linear",

"partitionNum": 0

},

"binaryVersion": null,

"size": 0,

"identifier": "TraceLog_2017-04-30T02:00:00.000Z_2017-04-30T03:00:00.000Z_2017-04-30T02:41:58.367Z"

},

"servers": [

"druid-prod03:8103"

]

}

it is very strange that the loadSpec and dimensions are empty in above.

I have checked all logs in coordinator and historical nodes, there are no ERROR or Exception occurs at this time, and no logs about the segment, the only clue is that the logs in peon, says: “Still waiting for Handoff for Segments”

How can i fixed it? thanks very much

My guess is that this segment is assigned to a historical whose load queue is always full due to disk failure(or some other reason), you can check all historical logs to make sure it is not this case

在 2017年4月30日星期日 UTC+8下午1:06:18,Leon写道:

Hi, thanks for your reply

I have check all my historical logs and coordinator logs, no error/exception found, even any logs about the task are not found.

and the disk still has 40% capacity in each historical node

the situation is not always happens

在 2017年5月2日星期二 UTC+8上午11:53:00,丁凯剑写道:

Hi 丁凯剑

It there any possible that the segment is too big? I found that in the failed case that the segments are too big, more than 4G size(but not all failed, some segments more than 4G succeeded).

and the failed segments will print some logs like:

2017-05-07T01:20:00,008 INFO [TraceLog-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [1] segments. Attempting to hand off segments that start before [1970-01-01T00:00:00.000Z].
2017-05-07T01:20:00,008 INFO [TraceLog-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] sinks to persist and merge

Is ‘1970-01-01T00:00:00.000Z’ a error clue?

Thank you for any suggestion.

在 2017年5月2日星期二 UTC+8上午11:53:00,丁凯剑写道:

Hi This can caused by multiple issue.

Check that the segment is persisted in Hdfs

Check that the metadata storage segment tables contains the right load spec

Check that the overlord knows about the segments and trying to assign it to one historical.

Check that historical and overlord are well connected via Zk sometimes it happens that the quorum is broken.

Posting the logs of historical overlord and realtime nodes will help debugging this

Hello, I have the same problem. Have you solved it?

在 2017年4月30日星期日 UTC+8下午1:06:18,Leon写道: