Overlord complains about duplicate segment keys in Druid 0.10.1

Hi, we’re in the process of upgrading from Druid 0.9.0 to Druid 0.10.1 and we see exceptions when real-time peons finish executing and publish their segments. We do not see these exceptions with Druid 0.9.0. This problem seems to only occur when real-time tasks are replicated – if we scale back the number of task replicas to 1, the problem goes away. Here is a log trace from the overlord, filtered to lines having to do with the segment in question:

2017-12-07T04:15:02,505 INFO [qtp1625539182-203] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1]: SegmentInsertAction{segments=[DataSegment{size=23250, shardSpec=LinearShardSpec{partitionNum=0}, metrics=[sum_pkts_in, sum_pkts_out, sum_bytes_in, sum_bytes_out, cnt], dimensions=[enterprise_id, cd_serial, app_name, app_type, data_tag, network_name, network_obj_id, network_rev_id, remote_cd_serial, remote_ip, remote_ip_hex], version=‘2017-12-07T03:00:13.292Z’, loadSpec={type=hdfs, path=hdfs://nameservice1/druid/network_remote_stats/20171207T030000.000Z_20171207T040000.000Z/2017-12-07T03_00_13.292Z/0_index.zip}, interval=2017-12-07T03:00:00.000Z/2017-12-07T04:00:00.000Z, dataSource=‘network_remote_stats’, binaryVersion=‘9’}]}

2017-12-07T04:15:02,506 INFO [qtp1625539182-203] io.druid.indexing.overlord.MetadataTaskStorage - Logging action for task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1]: SegmentInsertAction{segments=[DataSegment{size=23250, shardSpec=LinearShardSpec{partitionNum=0}, metrics=[sum_pkts_in, sum_pkts_out, sum_bytes_in, sum_bytes_out, cnt], dimensions=[enterprise_id, cd_serial, app_name, app_type, data_tag, network_name, network_obj_id, network_rev_id, remote_cd_serial, remote_ip, remote_ip_hex], version=‘2017-12-07T03:00:13.292Z’, loadSpec={type=hdfs, path=hdfs://nameservice1/druid/network_remote_stats/20171207T030000.000Z_20171207T040000.000Z/2017-12-07T03_00_13.292Z/0_index.zip}, interval=2017-12-07T03:00:00.000Z/2017-12-07T04:00:00.000Z, dataSource=‘network_remote_stats’, binaryVersion=‘9’}]}

2017-12-07T04:15:02,552 INFO [qtp1625539182-234] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0]: SegmentInsertAction{segments=[DataSegment{size=23250, shardSpec=LinearShardSpec{partitionNum=0}, metrics=[sum_pkts_in, sum_pkts_out, sum_bytes_in, sum_bytes_out, cnt], dimensions=[enterprise_id, cd_serial, app_name, app_type, data_tag, network_name, network_obj_id, network_rev_id, remote_cd_serial, remote_ip, remote_ip_hex], version=‘2017-12-07T03:00:13.292Z’, loadSpec={type=hdfs, path=hdfs://nameservice1/druid/network_remote_stats/20171207T030000.000Z_20171207T040000.000Z/2017-12-07T03_00_13.292Z/0_index.zip}, interval=2017-12-07T03:00:00.000Z/2017-12-07T04:00:00.000Z, dataSource=‘network_remote_stats’, binaryVersion=‘9’}]}

2017-12-07T04:15:02,552 INFO [qtp1625539182-234] io.druid.indexing.overlord.MetadataTaskStorage - Logging action for task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0]: SegmentInsertAction{segments=[DataSegment{size=23250, shardSpec=LinearShardSpec{partitionNum=0}, metrics=[sum_pkts_in, sum_pkts_out, sum_bytes_in, sum_bytes_out, cnt], dimensions=[enterprise_id, cd_serial, app_name, app_type, data_tag, network_name, network_obj_id, network_rev_id, remote_cd_serial, remote_ip, remote_ip_hex], version=‘2017-12-07T03:00:13.292Z’, loadSpec={type=hdfs, path=hdfs://nameservice1/druid/network_remote_stats/20171207T030000.000Z_20171207T040000.000Z/2017-12-07T03_00_13.292Z/0_index.zip}, interval=2017-12-07T03:00:00.000Z/2017-12-07T04:00:00.000Z, dataSource=‘network_remote_stats’, binaryVersion=‘9’}]}

2017-12-07T04:15:02,828 INFO [qtp1625539182-203] io.druid.metadata.IndexerSQLMetadataStorageCoordinator - Published segment [network_remote_stats_2017-12-07T03:00:00.000Z_2017-12-07T04:00:00.000Z_2017-12-07T03:00:13.292Z] to DB

2017-12-07T04:15:03,183 ERROR [qtp1625539182-234] io.druid.metadata.IndexerSQLMetadataStorageCoordinator - Exception inserting segment [network_remote_stats_2017-12-07T03:00:00.000Z_2017-12-07T04:00:00.000Z_2017-12-07T03:00:13.292Z] into DB

org.skife.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry ‘network_remote_stats_2017-12-07T03:00:00.000Z_2017-12-07T04:00:0’ for key ‘PRIMARY’ [statement:“INSERT INTO druid_segments (id, dataSource, created_date, start, end, partitioned, version, used, payload) VALUES (:id, :dataSource, :created_date, :start, :end, :partitioned, :version, :used, :payload)”, located:“INSERT INTO druid_segments (id, dataSource, created_date, start, end, partitioned, version, used, payload) VALUES (:id, :dataSource, :created_date, :start, :end, :partitioned, :version, :used, :payload)”, rewritten:“INSERT INTO druid_segments (id, dataSource, created_date, start, end, partitioned, version, used, payload) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)”, arguments:{ positional:{}, named:{payload:[123, 34, 100, 97, 116, 97, 83, 111, 117, 114, 99, 101, 34, 58, 34, 110, 101, 116, 119, 111, 114, 107, 95, 114, 101, 109, 111, 116, 101, 95, 115, 116, 97, 116, 115, 34, 44, 34, 105, 110, 116, 101, 114, 118, 97, 108, 34, 58, 34, 50, 48, 49, 55, 45, 49, 50, 45, 48, 55, 84, 48, 51, 58, 48, 48, 58, 48, 48, 46, 48, 48, 48, 90, 47, 50, 48, 49, 55, 45, 49, 50, 45, 48, 55, 84, 48, 52, 58, 48, 48, 58, 48, 48, 46, 48, 48, 48, 90, 34, 44, 34, 118, 101, 114, 115, 105, 111, 110, 34, 58, 34, 50, 48, 49, 55, 45, 49, 50, 45, 48, 55, 84, 48, 51, 58, 48, 48, 58, 49, 51, 46, 50, 57, 50, 90, 34, 44, 34, 108, 111, 97, 100, 83, 112, 101, 99, 34, 58, 123, 34, 116, 121, 112, 101, 34, 58, 34, 104, 100, 102, 115, 34, 44, 34, 112, 97, 116, 104, 34, 58, 34, 104, 100, 102, 115, 58, 47, 47, 110, 97, 109, 101, 115, 101, 114, 118, 105, 99, 101, 49, 47, 100, 114, 117, 105, 100, 47, 110, 101, 116, 119, 111, 114, 107, 95, 114, 101, 109, 111, 116, 101, 95, 115, 116, 97, 116, 115, 47, 50, 48, 49, 55, 49, 50, 48, 55, 84, 48, 51, 48, 48, 48, 48, 46, 48, 48, 48, 90, 95, 50, 48, 49, 55, 49, 50, 48, 55, 84, 48, 52, 48, 48, 48, 48, 46, 48, 48, 48, 90, 47, 50, 48, 49, 55, 45, 49, 50, 45, 48, 55, 84, 48, 51, 95, 48, 48, 95, 49, 51, 46, 50, 57, 50, 90, 47, 48, 95, 105, 110, 100, 101, 120, 46, 122, 105, 112, 34, 125, 44, 34, 100, 105, 109, 101, 110, 115, 105, 111, 110, 115, 34, 58, 34, 101, 110, 116, 101, 114, 112, 114, 105, 115, 101, 95, 105, 100, 44, 99, 100, 95, 115, 101, 114, 105, 97, 108, 44, 97, 112, 112, 95, 110, 97, 109, 101, 44, 97, 112, 112, 95, 116, 121, 112, 101, 44, 100, 97, 116, 97, 95, 116, 97, 103, 44, 110, 101, 116, 119, 111, 114, 107, 95, 110, 97, 109, 101, 44, 110, 101, 116, 119, 111, 114, 107, 95, 111, 98, 106, 95, 105, 100, 44, 110, 101, 116, 119, 111, 114, 107, 95, 114, 101, 118, 95, 105, 100, 44, 114, 101, 109, 111, 116, 101, 95, 99, 100, 95, 115, 101, 114, 105, 97, 108, 44, 114, 101, 109, 111, 116, 101, 95, 105, 112, 44, 114, 101, 109, 111, 116, 101, 95, 105, 112, 95, 104, 101, 120, 34, 44, 34, 109, 101, 116, 114, 105, 99, 115, 34, 58, 34, 115, 117, 109, 95, 112, 107, 116, 115, 95, 105, 110, 44, 115, 117, 109, 95, 112, 107, 116, 115, 95, 111, 117, 116, 44, 115, 117, 109, 95, 98, 121, 116, 101, 115, 95, 105, 110, 44, 115, 117, 109, 95, 98, 121, 116, 101, 115, 95, 111, 117, 116, 44, 99, 110, 116, 34, 44, 34, 115, 104, 97, 114, 100, 83, 112, 101, 99, 34, 58, 123, 34, 116, 121, 112, 101, 34, 58, 34, 108, 105, 110, 101, 97, 114, 34, 44, 34, 112, 97, 114, 116, 105, 116, 105, 111, 110, 78, 117, 109, 34, 58, 48, 125, 44, 34, 98, 105, 110, 97, 114, 121, 86, 101, 114, 115, 105, 111, 110, 34, 58, 57, 44, 34, 115, 105, 122, 101, 34, 58, 50, 51, 50, 53, 48, 44, 34, 105, 100, 101, 110, 116, 105, 102, 105, 101, 114, 34, 58, 34, 110, 101, 116, 119, 111, 114, 107, 95, 114, 101, 109, 111, 116, 101, 95, 115, 116, 97, 116, 115, 95, 50, 48, 49, 55, 45, 49, 50, 45, 48, 55, 84, 48, 51, 58, 48, 48, 58, 48, 48, 46, 48, 48, 48, 90, 95, 50, 48, 49, 55, 45, 49, 50, 45, 48, 55, 84, 48, 52, 58, 48, 48, 58, 48, 48, 46, 48, 48, 48, 90, 95, 50, 48, 49, 55, 45, 49, 50, 45, 48, 55, 84, 48, 51, 58, 48, 48, 58, 49, 51, 46, 50, 57, 50, 90, 34, 125],start:‘2017-12-07T03:00:00.000Z’,partitioned:true,end:‘2017-12-07T04:00:00.000Z’,id:‘network_remote_stats_2017-12-07T03:00:00.000Z_2017-12-07T04:00:00.000Z_2017-12-07T03:00:13.292Z’,created_date:‘2017-12-07T04:15:02.922Z’,used:true,dataSource:‘network_remote_stats’,version:‘2017-12-07T03:00:13.292Z’}, finder:}]

Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry ‘network_remote_stats_2017-12-07T03:00:00.000Z_2017-12-07T04:00:0’ for key ‘PRIMARY’

2017-12-07T04:15:04,063 INFO [qtp1625539182-234] io.druid.metadata.IndexerSQLMetadataStorageCoordinator - Found [network_remote_stats_2017-12-07T03:00:00.000Z_2017-12-07T04:00:00.000Z_2017-12-07T03:00:13.292Z] in DB, not updating DB

2017-12-07T04:17:20,160 INFO [qtp1625539182-225] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0]: LockReleaseAction{interval=2017-12-07T03:00:00.000Z/2017-12-07T04:00:00.000Z}

2017-12-07T04:17:20,160 INFO [qtp1625539182-225] io.druid.indexing.overlord.TaskLockbox - Removing task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0] from TaskLock[index_realtime_network_remote_stats]

2017-12-07T04:17:21,344 INFO [qtp1625539182-214] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1]: LockReleaseAction{interval=2017-12-07T03:00:00.000Z/2017-12-07T04:00:00.000Z}

2017-12-07T04:17:21,344 INFO [qtp1625539182-214] io.druid.indexing.overlord.TaskLockbox - Removing task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1] from TaskLock[index_realtime_network_remote_stats]

2017-12-07T04:17:21,571 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[host1:8080] wrote SUCCESS status for task [index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0] on [TaskLocation{host=‘host1’, port=8106}]

2017-12-07T04:17:21,571 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[host1:8080] completed task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0] with status[SUCCESS]

2017-12-07T04:17:21,571 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Received SUCCESS status for task: index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0

2017-12-07T04:17:21,571 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0] on worker[host1:8080]

2017-12-07T04:17:21,582 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskLockbox - Removing task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0] from activeTasks

2017-12-07T04:17:21,583 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0 to status: TaskStatus{id=index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0, status=SUCCESS, duration=4927786}

2017-12-07T04:17:21,710 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Task done: RealtimeIndexTask{id=index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0, type=index_realtime, dataSource=network_remote_stats}

2017-12-07T04:17:21,710 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Task SUCCESS: RealtimeIndexTask{id=index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0, type=index_realtime, dataSource=network_remote_stats} (4927786 run duration)

2017-12-07T04:17:21,710 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0] status changed to [SUCCESS].

2017-12-07T04:17:21,710 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0] went bye bye.

2017-12-07T04:17:22,929 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[host2:8080] wrote SUCCESS status for task [index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1] on [TaskLocation{host=‘host2’, port=8104}]

2017-12-07T04:17:22,929 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[host2:8080] completed task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1] with status[SUCCESS]

2017-12-07T04:17:22,929 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Received SUCCESS status for task: index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1

2017-12-07T04:17:22,930 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1] on worker[host2:8080]

2017-12-07T04:17:22,937 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskLockbox - Removing task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1] from activeTasks

2017-12-07T04:17:22,939 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1 to status: TaskStatus{id=index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1, status=SUCCESS, duration=4929152}

2017-12-07T04:17:23,039 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Task done: RealtimeIndexTask{id=index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1, type=index_realtime, dataSource=network_remote_stats}

2017-12-07T04:17:23,039 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskQueue - Task SUCCESS: RealtimeIndexTask{id=index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1, type=index_realtime, dataSource=network_remote_stats} (4929152 run duration)

2017-12-07T04:17:23,039 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1] status changed to [SUCCESS].

2017-12-07T04:17:23,039 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1] went bye bye.

From the logs it seems pretty clear that the index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_1 task wins a race to insert the network_remote_stats_2017-12-07T03:00:00.000Z_2017-12-07T04:00:00.000Z_2017-12-07T03:00:13.292Z segment into the database (MySQL), and then index_realtime_network_remote_stats_2017-12-07T03:00:00.000Z_0_0 throws an exception when it sees the segment already exists.

Is anyone else seeing this problem? The cluster appears to be healthy otherwise. Thanks!

–T