AWS - "Still waiting for Handoff for Segments" Messages - Tasks are running for ever

I am using Tranquility API to push events to Druid. Attached is a sample spec[stats.json] that I was using for the task in the content.

Here are the key aspects of the segment configuration -

“segmentGranularity” : “FIFTEEN_MINUTE”

“intermediatePersistPeriod” : “PT10M”,

“windowPeriod” : “PT10M”

I see segments are created in a 15 min interval but are never completed. They are in RUNNING status for over 6 hours now. In the task log, I can see that segment is copied to S3 successfully.

2016-05-10T19:45:08,583 INFO [AppResourceStats-2016-05-10T19:15:00.000Z-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Pushing [AppResourceStats_2016-05-10T19:15:00.000Z_2016-05-10T19:30:00.000Z_2016-05-10T19:18:21.640Z] to deep storage


And I verified the segment in S3.

Then I see communication happening to Overlord -

2016-05-10T19:45:11,333 INFO [AppResourceStats-2016-05-10T19:15:00.000Z-persist-n-merge] io.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_realtime_AppResourceStats_2016-05-10T19:15:00.000Z_0_0] to overlord[http://druidmaster:8090/druid/indexer/v1/action]: SegmentInsertAction{segments=[DataSegment{size=7968910, shardSpec=LinearShardSpec{partitionNum=0}, metrics=[count, FREE, USED, MAX_ALLOWED], dimensions=[SERVER_NAME, APPLICATION_NAME, RESOURCE_NAME, RESOURCE_TYPE, PENDING], version=‘2016-05-10T19:18:21.640Z’, loadSpec={type=s3_zip, bucket=pclndruid, key=druid/segments/AppResourceStats/2016-05-10T19:15:00.000Z_2016-05-10T19:30:00.000Z/2016-05-10T19:18:21.640Z/0/}, interval=2016-05-10T19:15:00.000Z/2016-05-10T19:30:00.000Z, dataSource=‘AppResourceStats’, binaryVersion=‘9’}]}


After this point, all I see is -

2016-05-10T19:45:41,071 INFO [coordinator_handoff_scheduled_0] io.druid.segment.realtime.plumber.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for Segments : [[SegmentDescriptor{interval=2016-05-10T19:15:00.000Z/2016-05-10T19:30:00.000Z, version=‘2016-05-10T19:18:21.640Z’, partitionNumber=0}]]


I am running Historical and Middle managers in separate EC2 instances - 8 core, 61GB RAM and 160 SSD each.

Here is my historical file -


HTTP server threads


Processing threads and buffers


Segment storage


Query cache

#druid.monitoring.monitors=[“io.druid.server.metrics.HistoricalMetricsMonitor”, “com.metamx.metrics.JvmMonitor”]



Any thoughts on why the tasks are not going to SUCCESS state?

stats.json (1.89 KB)

Hey Jagadeesh,

Anything interesting in your coordinator and historical logs? The coordinator is the process that detects the new segment built by the indexing task and signals the historical nodes to load the segment. The indexing task will only complete once it gets notification that a historical has picked up the segment so it knows it can stop serving it. The coordinator logs should help determine whether or not the coordinator noticed the new segment, if it tried to signal a historical to load it but failed, if there were rules preventing it from loading, etc. Historical logs would show you if a historical received the load order but failed for some reason (e.g. out of memory).

Thanks for explaining the internal processing David.

I don’t see any memory issues, and I do see the hand off in historical nodes. I stopped druid nodes and the log files are flooded with connectivity issues. I cleaned up my meta-data and started testing again. Will keep this thread posted. Thanks

I have been running for past 12 hours consuming from single stream. Around 100M rows were inserted into Druid with segments created every 15 mins, no issues so far. Not sure what caused it earlier.

hi, I think your segment is too bigger than 2G, try set task.partitions property

Sure, will try that.