AWS - "Still waiting for Handoff for Segments" Messages - Tasks are running for ever

I am using Tranquility API to push events to Druid. Attached is a sample spec[stats.json] that I was using for the task in the content.

Here are the key aspects of the segment configuration -

“segmentGranularity” : “FIFTEEN_MINUTE”

“intermediatePersistPeriod” : “PT10M”,

“windowPeriod” : “PT10M”

I see segments are created in a 15 min interval but are never completed. They are in RUNNING status for over 6 hours now. In the task log, I can see that segment is copied to S3 successfully.

2016-05-10T19:45:08,583 INFO [AppResourceStats-2016-05-10T19:15:00.000Z-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Pushing [AppResourceStats_2016-05-10T19:15:00.000Z_2016-05-10T19:30:00.000Z_2016-05-10T19:18:21.640Z] to deep storage

``

And I verified the segment in S3.

Then I see communication happening to Overlord -

2016-05-10T19:45:11,333 INFO [AppResourceStats-2016-05-10T19:15:00.000Z-persist-n-merge] io.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_realtime_AppResourceStats_2016-05-10T19:15:00.000Z_0_0] to overlord[http://druidmaster:8090/druid/indexer/v1/action]: SegmentInsertAction{segments=[DataSegment{size=7968910, shardSpec=LinearShardSpec{partitionNum=0}, metrics=[count, FREE, USED, MAX_ALLOWED], dimensions=[SERVER_NAME, APPLICATION_NAME, RESOURCE_NAME, RESOURCE_TYPE, PENDING], version=‘2016-05-10T19:18:21.640Z’, loadSpec={type=s3_zip, bucket=pclndruid, key=druid/segments/AppResourceStats/2016-05-10T19:15:00.000Z_2016-05-10T19:30:00.000Z/2016-05-10T19:18:21.640Z/0/index.zip}, interval=2016-05-10T19:15:00.000Z/2016-05-10T19:30:00.000Z, dataSource=‘AppResourceStats’, binaryVersion=‘9’}]}

``

After this point, all I see is -

2016-05-10T19:45:41,071 INFO [coordinator_handoff_scheduled_0] io.druid.segment.realtime.plumber.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for Segments : [[SegmentDescriptor{interval=2016-05-10T19:15:00.000Z/2016-05-10T19:30:00.000Z, version=‘2016-05-10T19:18:21.640Z’, partitionNumber=0}]]

``

I am running Historical and Middle managers in separate EC2 instances - 8 core, 61GB RAM and 160 SSD each.

Here is my historical runtime.properties file -

druid.service=druid/historical
druid.port=8083

HTTP server threads

druid.server.http.numThreads=40

Processing threads and buffers

druid.processing.buffer.sizeBytes=1073741824
druid.processing.numThreads=7

Segment storage

druid.segmentCache.locations=[{“path”:“var/druid/segment-cache”,“maxSize”:130000000000}]
druid.server.maxSize=130000000000

Query cache

druid.broker.cache.useCache=true
druid.broker.cache.populateCache=true
druid.cache.type=local
druid.cache.sizeInBytes=2000000000

#druid.monitoring.monitors=[“io.druid.server.metrics.HistoricalMetricsMonitor”, “com.metamx.metrics.JvmMonitor”]

druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true

``

Any thoughts on why the tasks are not going to SUCCESS state?

stats.json (1.89 KB)

Hey Jagadeesh,

Anything interesting in your coordinator and historical logs? The coordinator is the process that detects the new segment built by the indexing task and signals the historical nodes to load the segment. The indexing task will only complete once it gets notification that a historical has picked up the segment so it knows it can stop serving it. The coordinator logs should help determine whether or not the coordinator noticed the new segment, if it tried to signal a historical to load it but failed, if there were rules preventing it from loading, etc. Historical logs would show you if a historical received the load order but failed for some reason (e.g. out of memory).

Thanks for explaining the internal processing David.

I don’t see any memory issues, and I do see the hand off in historical nodes. I stopped druid nodes and the log files are flooded with connectivity issues. I cleaned up my meta-data and started testing again. Will keep this thread posted. Thanks

I have been running for past 12 hours consuming from single stream. Around 100M rows were inserted into Druid with segments created every 15 mins, no issues so far. Not sure what caused it earlier.

hi, I think your segment is too bigger than 2G, try set task.partitions property

Sure, will try that.