Periodic problems with publishing a segmens with KIS task

Hi!

In our cluster some indexing task failed to publish segments. I found two error messages: in the tasks log and overloards log.
One segment dsp_traf_supervisor_2018-10-12T11:15:00.000Z_2018-10-12T11:30:00.000Z_2018-10-12T11:15:00.092Z_12 have problems with duplicated key in pendingsegments table.**

.I have no idea why this is been happening. Could someone help me what’s causing this?

Druid version is 0.12.3

***Task log example ***

2018-10-12T11:18:30,091 INFO [publish-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Publishing segments with commitMetadata[AppenderatorDriverMetadata{segments={index_kafka_dsp_traf_supervisor_6d81720c3fb7dec_0=[io.druid.segment.realtime.appenderator.SegmentWithState@14d04136, io.druid.segment.realtime.appenderator.SegmentWithState@2b7b26b, io.druid.segment.realtime.appenderator.SegmentWithState@18a57b3c]}, lastSegmentIds={index_kafka_dsp_traf_supervisor_6d81720c3fb7dec_0=dsp_traf_supervisor_2018-10-12T11:15:00.000Z_2018-10-12T11:30:00.000Z_2018-10-12T11:15:00.092Z_12}, callerMetadata={nextPartitions=KafkaPartitions{topic=‘druid_queue’, partitionOffsetMap={12=6430979856}}, publishPartitions=KafkaPartitions{topic=‘druid_queue’, partitionOffsetMap={12=6430979856}}}}]: [DataSegment{size=14241658, shardSpec=NumberedShardSpec{partitionNum=12, partitions=0}, metrics=[merics], version=‘2018-10-12T11:15:00.092Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T111500.000Z_20181012T113000.000Z/2018-10-12T11_15_00.092Z/12_b4531fa6-ff31-4f77-8ec0-622576357348_index.zip}, interval=2018-10-12T11:15:00.000Z/2018-10-12T11:30:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}, DataSegment{size=116478, shardSpec=NumberedShardSpec{partitionNum=38, partitions=0}, metrics=[metrics], version=‘2018-10-12T10:45:00.042Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T104500.000Z_20181012T110000.000Z/2018-10-12T10_45_00.042Z/38_0206ea77-a14a-4089-a378-d953b7136e46_index.zip}, interval=2018-10-12T10:45:00.000Z/2018-10-12T11:00:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}, DataSegment{size=109597332, shardSpec=NumberedShardSpec{partitionNum=19, partitions=0}, metrics=[auctions, visitors_unique, bet_count, win_count, impression_count, win_price, price_fl, click_count, sum_bets, auction_quote, marker], dimensions=[“dimensions”], version=‘2018-10-12T11:00:00.076Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T110000.000Z_20181012T111500.000Z/2018-10-12T11_00_00.076Z/19_7eed516a-5458-4518-8cb4-38c0b5173f88_index.zip}, interval=2018-10-12T11:00:00.000Z/2018-10-12T11:15:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}]

2018-10-12T11:18:30,095 INFO [publish-0] io.druid.indexing.kafka.KafkaIndexTask - Publishing with isTransaction[true].

2018-10-12T11:18:30,096 INFO [publish-0] io.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_kafka_dsp_traf_supervisor_6d81720c3fb7dec_cpfkmaoi]: SegmentInsertAction{segments=[DataSegment{size=14241658, shardSpec=NumberedShardSpec{partitionNum=12, partitions=0}, metrics=[metrics], dimensions=[“dimensions”], version=‘2018-10-12T11:15:00.092Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T111500.000Z_20181012T113000.000Z/2018-10-12T11_15_00.092Z/12_b4531fa6-ff31-4f77-8ec0-622576357348_index.zip}, interval=2018-10-12T11:15:00.000Z/2018-10-12T11:30:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}, DataSegment{size=116478, shardSpec=NumberedShardSpec{partitionNum=38, partitions=0}, metrics=[metrics], dimensions=[“dims”], version=‘2018-10-12T10:45:00.042Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T104500.000Z_20181012T110000.000Z/2018-10-12T10_45_00.042Z/38_0206ea77-a14a-4089-a378-d953b7136e46_index.zip}, interval=2018-10-12T10:45:00.000Z/2018-10-12T11:00:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}, DataSegment{size=109597332, shardSpec=NumberedShardSpec{partitionNum=19, partitions=0}, metrics=[auctions, visitors_unique, bet_count, win_count, impression_count, win_price, price_fl, click_count, sum_bets, auction_quote, marker], dimensions=[“dims”], version=‘2018-10-12T11:00:00.076Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T110000.000Z_20181012T111500.000Z/2018-10-12T11_00_00.076Z/19_7eed516a-5458-4518-8cb4-38c0b5173f88_index.zip}, interval=2018-10-12T11:00:00.000Z/2018-10-12T11:15:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}], startMetadata=KafkaDataSourceMetadata{kafkaPartitions=KafkaPartitions{topic=‘druid_queue’, partitionOffsetMap={12=6429749689}}}, endMetadata=KafkaDataSourceMetadata{kafkaPartitions=KafkaPartitions{topic=‘druid_queue’, partitionOffsetMap={12=6430979856}}}}

2018-10-12T11:18:30,108 INFO [publish-0] io.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_kafka_dsp_traf_supervisor_6d81720c3fb7dec_cpfkmaoi] to overlord: [SegmentInsertAction{segments=[DataSegment{size=14241658, shardSpec=NumberedShardSpec{partitionNum=12, partitions=0}, metrics=[metrics], dimensions=[“dims”], version=‘2018-10-12T11:15:00.092Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T111500.000Z_20181012T113000.000Z/2018-10-12T11_15_00.092Z/12_b4531fa6-ff31-4f77-8ec0-622576357348_index.zip}, interval=2018-10-12T11:15:00.000Z/2018-10-12T11:30:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}, DataSegment{size=116478, shardSpec=NumberedShardSpec{partitionNum=38, partitions=0}, metrics=[metrics], dimensions=[“dims”], version=‘2018-10-12T10:45:00.042Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T104500.000Z_20181012T110000.000Z/2018-10-12T10_45_00.042Z/38_0206ea77-a14a-4089-a378-d953b7136e46_index.zip}, interval=2018-10-12T10:45:00.000Z/2018-10-12T11:00:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}, DataSegment{size=109597332, shardSpec=NumberedShardSpec{partitionNum=19, partitions=0}, metrics=[auctions, visitors_unique, bet_count, win_count, impression_count, win_price, price_fl, click_count, sum_bets, auction_quote, marker], dimensions=[“dims”], version=‘2018-10-12T11:00:00.076Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T110000.000Z_20181012T111500.000Z/2018-10-12T11_00_00.076Z/19_7eed516a-5458-4518-8cb4-38c0b5173f88_index.zip}, interval=2018-10-12T11:00:00.000Z/2018-10-12T11:15:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}], startMetadata=KafkaDataSourceMetadata{kafkaPartitions=KafkaPartitions{topic=‘druid_queue’, partitionOffsetMap={12=6429749689}}}, endMetadata=KafkaDataSourceMetadata{kafkaPartitions=KafkaPartitions{topic=‘druid_queue’, partitionOffsetMap={12=6430979856}}}}].

2018-10-12T11:18:30,124 INFO [publish-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Transaction failure while publishing segments, removing them from deep storage and checking if someone else beat us to publishing.

2018-10-12T11:18:30,128 INFO [publish-0] io.druid.storage.hdfs.HdfsDataSegmentKiller - Killing segment[dsp_traf_supervisor_2018-10-12T11:15:00.000Z_2018-10-12T11:30:00.000Z_2018-10-12T11:15:00.092Z_12] mapped to path[hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T111500.000Z_20181012T113000.000Z/2018-10-12T11_15_00.092Z/12_b4531fa6-ff31-4f77-8ec0-622576357348_index.zip]

2018-10-12T11:18:30,153 INFO [publish-0] io.druid.storage.hdfs.HdfsDataSegmentKiller - Killing segment[dsp_traf_supervisor_2018-10-12T10:45:00.000Z_2018-10-12T11:00:00.000Z_2018-10-12T10:45:00.042Z_38] mapped to path[hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T104500.000Z_20181012T110000.000Z/2018-10-12T10_45_00.042Z/38_0206ea77-a14a-4089-a378-d953b7136e46_index.zip]

2018-10-12T11:18:30,172 INFO [publish-0] io.druid.storage.hdfs.HdfsDataSegmentKiller - Killing segment[dsp_traf_supervisor_2018-10-12T11:00:00.000Z_2018-10-12T11:15:00.000Z_2018-10-12T11:00:00.076Z_19] mapped to path[hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T110000.000Z_20181012T111500.000Z/2018-10-12T11_00_00.076Z/19_7eed516a-5458-4518-8cb4-38c0b5173f88_index.zip]

2018-10-12T11:18:30,190 INFO [publish-0] io.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_kafka_dsp_traf_supervisor_6d81720c3fb7dec_cpfkmaoi]: SegmentListUsedAction{dataSource=‘dsp_traf_supervisor’, intervals=[2018-10-12T10:45:00.000Z/2018-10-12T11:30:00.000Z]}

2018-10-12T11:18:30,192 INFO [publish-0] io.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_kafka_dsp_traf_supervisor_6d81720c3fb7dec_cpfkmaoi] to overlord: [SegmentListUsedAction{dataSource=‘dsp_traf_supervisor’, intervals=[2018-10-12T10:45:00.000Z/2018-10-12T11:30:00.000Z]}].

2018-10-12T11:18:30,248 WARN [publish-0] io.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Failed publish, not removing segments: [DataSegment{size=14241658, shardSpec=NumberedShardSpec{partitionNum=12, partitions=0}, metrics=[metrics], dimensions=[“dims”], version=‘2018-10-12T11:15:00.092Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T111500.000Z_20181012T113000.000Z/2018-10-12T11_15_00.092Z/12_b4531fa6-ff31-4f77-8ec0-622576357348_index.zip}, interval=2018-10-12T11:15:00.000Z/2018-10-12T11:30:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}, DataSegment{size=116478, shardSpec=NumberedShardSpec{partitionNum=38, partitions=0}, metrics=[metrics], dimensions=[“dims”], version=‘2018-10-12T10:45:00.042Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T104500.000Z_20181012T110000.000Z/2018-10-12T10_45_00.042Z/38_0206ea77-a14a-4089-a378-d953b7136e46_index.zip}, interval=2018-10-12T10:45:00.000Z/2018-10-12T11:00:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}, DataSegment{size=109597332, shardSpec=NumberedShardSpec{partitionNum=19, partitions=0}, metrics=[metrics], dimensions=[“dims”], version=‘2018-10-12T11:00:00.076Z’, loadSpec={type=>hdfs, path=>hdfs://nameservice/druid/segments/dsp_traf_supervisor/20181012T110000.000Z_20181012T111500.000Z/2018-10-12T11_00_00.076Z/19_7eed516a-5458-4518-8cb4-38c0b5173f88_index.zip}, interval=2018-10-12T11:00:00.000Z/2018-10-12T11:15:00.000Z, dataSource=‘dsp_traf_supervisor’, binaryVersion=‘9’}]

io.druid.java.util.common.ISE: Failed to publish segments.

at io.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:578) ~[druid-server-0.12.3-iap6.jar:0.12.3-iap6]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]

2018-10-12T11:18:30,252 ERROR [publish-0] io.druid.indexing.kafka.KafkaIndexTask - Error while publishing segments for sequence[SequenceMetadata{sequenceName=‘index_kafka_dsp_traf_supervisor_6d81720c3fb7dec_0’, sequenceId=0, startOffsets={12=6429749689}, endOffsets={12=6430979856}, assignments=, sentinel=false, checkpointed=true}]

io.druid.java.util.common.ISE: Failed to publish segments.

at io.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:578) ~[druid-server-0.12.3-iap6.jar:0.12.3-iap6]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]

2018-10-12T11:18:30,257 INFO [task-runner-0-priority-0] io.druid.segment.realtime.appenderator.AppenderatorImpl - Shutting down immediately…

2018-10-12T11:18:30,260 INFO [task-runner-0-priority-0] io.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[dsp_traf_supervisor_2018-10-12T11:15:00.000Z_2018-10-12T11:30:00.000Z_2018-10-12T11:15:00.092Z_12] at path[/druid_new/segments/imply103:8109/imply103:8109_indexer-executor_hot_tier_2018-10-12T11:01:22.807Z_e93f05822d2649ce916eeafcadbf6e4a0]

2018-10-12T11:18:30,261 INFO [task-runner-0-priority-0] io.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[dsp_traf_supervisor_2018-10-12T10:45:00.000Z_2018-10-12T11:00:00.000Z_2018-10-12T10:45:00.042Z_38] at path[/druid_new/segments/imply103:8109/imply103:8109_indexer-executor_hot_tier_2018-10-12T11:01:22.807Z_e93f05822d2649ce916eeafcadbf6e4a0]

2018-10-12T11:18:30,263 INFO [task-runner-0-priority-0] io.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[dsp_traf_supervisor_2018-10-12T11:00:00.000Z_2018-10-12T11:15:00.000Z_2018-10-12T11:00:00.076Z_19] at path[/druid_new/segments/imply103:8109/imply103:8109_indexer-executor_hot_tier_2018-10-12T11:01:22.807Z_e93f05822d2649ce916eeafcadbf6e4a0]

2018-10-12T11:18:30,263 INFO [task-runner-0-priority-0] io.druid.curator.announcement.Announcer - unannouncing [/druid_new/segments/imply103:8109/imply103:8109_indexer-executor_hot_tier_2018-10-12T11:01:22.807Z_e93f05822d2649ce916eeafcadbf6e4a0]

2018-10-12T11:18:30,279 INFO [task-runner-0-priority-0] io.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - Unregistering chat handler[index_kafka_dsp_traf_supervisor_6d81720c3fb7dec_cpfkmaoi]

2018-10-12T11:18:30,279 INFO [task-runner-0-priority-0] io.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannouncing [DiscoveryDruidNode{druidNode=DruidNode{serviceName=‘druid/middlemanager’, host=‘imply103’, port=-1, plaintextPort=8109, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeType=‘peon’, services={dataNodeService=DataNodeService{tier=‘hot_tier’, maxSize=0, type=indexer-executor, priority=0}, lookupNodeService=LookupNodeService{lookupTier=‘dont_exist_loookup_tier’}}}].

2018-10-12T11:18:30,279 INFO [task-runner-0-priority-0] io.druid.curator.announcement.Announcer - unannouncing [/druid_new/internal-discovery/peon/imply103:8109]

2018-10-12T11:18:30,283 INFO [task-runner-0-priority-0] io.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced [DiscoveryDruidNode{druidNode=DruidNode{serviceName=‘druid/middlemanager’, host=‘imply103’, port=-1, plaintextPort=8109, enablePlaintextPort=true, tlsPort=-1, enableTlsPort=false}, nodeType=‘peon’, services={dataNodeService=DataNodeService{tier=‘hot_tier’, maxSize=0, type=indexer-executor, priority=0}, lookupNodeService=LookupNodeService{lookupTier=‘dont_exist_loookup_tier’}}}].

2018-10-12T11:18:30,284 INFO [task-runner-0-priority-0] io.druid.server.coordination.CuratorDataSegmentServerAnnouncer - Unannouncing self[DruidServerMetadata{name=‘imply103:8109’, hostAndPort=‘imply103:8109’, hostAndTlsPort=‘null’, maxSize=0, tier=‘hot_tier’, type=indexer-executor, priority=0}] at [/druid_new/announcements/imply103:8109]

2018-10-12T11:18:30,284 INFO [task-runner-0-priority-0] io.druid.curator.announcement.Announcer - unannouncing [/druid_new/announcements/imply103:8109]

2018-10-12T11:18:30,288 ERROR [task-runner-0-priority-0] io.druid.indexing.kafka.KafkaIndexTask - Encountered exception while running task.

java.util.concurrent.ExecutionException: io.druid.java.util.common.ISE: Failed to publish segments.

at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[guava-16.0.1.jar:?]

at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[guava-16.0.1.jar:?]

at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]

at io.druid.indexing.kafka.KafkaIndexTask.runInternal(KafkaIndexTask.java:785) ~[druid-kafka-indexing-service-0.12.3-iap6.jar:0.12.3-iap6]

at io.druid.indexing.kafka.KafkaIndexTask.run(KafkaIndexTask.java:358) [druid-kafka-indexing-service-0.12.3-iap6.jar:0.12.3-iap6]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:457) [druid-indexing-service-0.12.3-iap6.jar:0.12.3-iap6]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:429) [druid-indexing-service-0.12.3-iap6.jar:0.12.3-iap6]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]

Caused by: io.druid.java.util.common.ISE: Failed to publish segments.

at io.druid.segment.realtime.appenderator.BaseAppenderatorDriver.lambda$publishInBackground$8(BaseAppenderatorDriver.java:578) ~[druid-server-0.12.3-iap6.jar:0.12.3-iap6]

… 4 more

2018-10-12T11:18:30,295 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_dsp_traf_supervisor_6d81720c3fb7dec_cpfkmaoi] status changed to [FAILED].

2018-10-12T11:18:30,298 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_kafka_dsp_traf_supervisor_6d81720c3fb7dec_cpfkmaoi”,

“status” : “FAILED”,

“duration” : 1029268,

“errorMsg” : “java.util.concurrent.ExecutionException: io.druid.java.util.common.ISE: Failed to publish segments.\n…”

}

``

*And related message in overloard.log *

2018-10-12T11:15:23,434 WARN [qtp612854398-134] io.druid.java.util.common.RetryUtils - Failed on try 5, retrying in 19,245ms.

Hey Павел,

These errors are usually transient and retryable (they happen when two threads are trying to allocate a segment at the same time). In fact, it’s totally normal for threads to cross paths from time to time, and Druid actually won’t log this error unless it’s gone through at least a few retries.

If you are seeing this repeatedly it might be because segment allocation is taking a long time, which increases the probability of conflicts. You can mitigate it by adding the indexes added in this PR: https://github.com/apache/incubator-druid/pull/6356. They’ll be added by default in 0.13+ and they help quite a bit. It can also help to enable pending segment cleanup (druid.coordinator.kill.pendingSegments.on; see http://druid.io/docs/latest/configuration/index.html#coordinator-operation). That will probably also be on by default in the future.

Thanks, Gian. Error in the overloard is gone. But… Indexing tasks failed as before(. We tried to check overlord logs but we did not get any exception there to explain this failure while publishing segments