Compaction task "Failed to publish segments"

Hi everyone,

Context: Druid 0.12.0, Kafka Ingestion real-time (I’ll link the spec below).

We have a compaction task that runs every night which fails about 10% of the times (basically every about 10 days). The root cause seems to be a mystery to me because feedback from logs are minimum. The task works well and than fails while publishing the segments at the very end of the process.

We recently reset the related supervisor but the issue is still there.

Logs below. Thank you for your help.

Compaction Task:

2018-06-11T03:03:36,525 INFO [publish-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Transaction failure while publishing segments, checking if someone else beat us to it.

[…]

2018-06-11T03:03:36,644 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[CompactionTask{id=druid_daily_compaction_ data_source_2018-06-10, type=compact, dataSource= data_source}]

java.lang.RuntimeException: java.util.concurrent.ExecutionException: io.druid.java.util.common.ISE: Failed to publish segments[[DataSegment{size=128023261, shardSpec=HashBasedNumberedShardSpec{partitionNum=0, partitions=2, partitionDimensions=}, metrics=[count, pageViewCount, totalTimeSpent, totalVisibleMillis], dimensions=[sessionId, pstuid, psuid, campaignId, utmSource, utmCampaign, utmMedium, entryUrl, entryQs, referrerUrl, deviceName, deviceType, utmTerm, utmContent, advertiserId, adgroupId, adcopyId, publisherId, templateId, positionId, websiteId, clickId], version=‘2018-06-11T03:00:10.739Z’, loadSpec={type=>google, bucket=>pws-druid-prod, path=>segments/data_source/2018-06-10T00:00:00.000Z_2018-06-11T00:00:00.000Z/2018-06-11T03:00:10.739Z/0/index.zip}, interval=2018-06-10T00:00:00.000Z/2018-06-11T00:00:00.000Z, dataSource=‘data_source’, binaryVersion=‘9’}, DataSegment{size=127990674, shardSpec=HashBasedNumberedShardSpec{partitionNum=1, partitions=2, partitionDimensions=}, metrics=[count, pageViewCount, totalTimeSpent, totalVisibleMillis], dimensions=[sessionId, pstuid, psuid, campaignId, utmSource, utmCampaign, utmMedium, entryUrl, entryQs, referrerUrl, deviceName, deviceType, utmTerm, utmContent, advertiserId, adgroupId, adcopyId, publisherId, templateId, positionId, websiteId, clickId], version=‘2018-06-11T03:00:10.739Z’, loadSpec={type=>google, bucket=>pws-druid-prod, path=>segments/data_source/2018-06-10T00:00:00.000Z_2018-06-11T00:00:00.000Z/2018-06-11T03:00:10.739Z/1/index.zip}, interval=2018-06-10T00:00:00.000Z/2018-06-11T00:00:00.000Z, dataSource=‘data_source’, binaryVersion=‘9’}]]

at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

at io.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:739) ~[druid-indexing-service-0.12.0.jar:0.12.0]

at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:264) ~[druid-indexing-service-0.12.0.jar:0.12.0]

at io.druid.indexing.common.task.CompactionTask.run(CompactionTask.java:209) ~[druid-indexing-service-0.12.0.jar:0.12.0]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:444) [druid-indexing-service-0.12.0.jar:0.12.0]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:416) [druid-indexing-service-0.12.0.jar:0.12.0]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_152]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_152]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_152]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_152]

Coordinator:

2018-06-11T03:03:37,719 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-middlemanager-01.c.pws-tracking-160715.internal:8091] wrote FAILED status for task [druid_daily_compaction_analytics_sessions_2018-06-10] on [TaskLocation{host=‘druid-middlemanager-01.c.pws-tracking-160715.internal’, port=8105, tlsPort=-1}]"

Compaction spec:

{
“id”: “druid_daily_compaction_2018-05-05”,
“type” : “compact”,
“dataSource” : “data_source”,
“interval” : “2018--/2018--”,
“tuningConfig”: {
“type” : “index”,
“numShards”: 3,
“forceGuaranteedRollup”: true
}
}

Kafka Ingestion spec:

{
“type”:“kafka”,
“dataSchema”:{
“dataSource”:“data_source”,
“parser”:{
“type”:“avro_stream”,
“avroBytesDecoder”:{
“type”:“schema_registry”,
“url”:"[…]"
},
“parseSpec”:{
“format”:“avro”,
“timestampSpec”:{
“column”:“ts”,
“format”:“auto”
},
“flattenSpec”:{
“fields”:[
{
“type”:“path”,
“name”:“deviceName”,
“expr”:".device..deviceName" }, { "type":"path", "name":"deviceType", "expr":".device…deviceType"
}
]
},
“dimensionsSpec”:{
“dimensions”:[…],
“dimensionExclusions”:
}
}
},
“metricsSpec”:[
{
“type”:“count”,
“name”:“count”
},
{
“type”:“doubleSum”,
“name”:“totalTimeSpent”,
“fieldName”:“totalTimeSpent”
},
{
“type”:“doubleSum”,
“name”:“pageViewCount”,
“fieldName”:“pageViewCount”
},
{
“type”:“doubleSum”,
“name”:“totalVisibleMillis”,
“fieldName”:“totalVisibleMillis”
}
],
“granularitySpec”:{
“type”:“uniform”,
“segmentGranularity”:“DAY”,
“queryGranularity”:“HOUR”
}
},
“tuningConfig”:{
“type”:“kafka”,
“maxRowsInMemory”:75000,
“maxRowsPerSegment”:5000000,
“intermediatePersistPeriod”:“PT10M”,
“resetOffsetAutomatically”:true
},
“ioConfig”:{
“topic”:“sessions”,
“replicas”:2,
“taskCount”:3,
“taskDuration”:“PT30M”,
“consumerProperties”:{
“bootstrap.servers”:[…],
group.id”:[…],
“auto.offset.reset”:“latest”
}
}
}