Segment merge tasks are not scheduled by the coordinator

Hello all,

We have several identical Druid deployments in production (0.9.1.1), all with the same configuration and setup.

Our datasources are being refreshed every 8 hours in a Hadoop batch indexing job. We use the coordinator’s druid.coordinator.merge.on flag to merge small segments.

In one deployment the merge tasks are simply not scheduled by the coordinator, and so we’re running into serious stability and performance issues.

Other deployments are working just fine.

**Coordinator configuration: **

druid.coordinator.startDelay=PT60s
druid.coordinator.merge.on=true
druid.coordinator.period.indexingPeriod=PT1800S
druid.coordinator.kill.on=true
druid.coordinator.kill.maxSegments=100
druid.coordinator.kill.durationToRetain=P7D

``

Errors:

I’m seeing the below errors repeatedly in the coordinator log (not sure if it’s relevant)

I don’t see any indication that the segment merger kicks in (no “Issued merge requests for %s segments” messages)

2017-11-13T00:00:13,744 - ERROR [Coordinator-Exec–0:Logger@97] - Caught exception, ignoring so that schedule keeps going.: {class=io.druid.server.coordinator.DruidCoordinator, exceptionType=class java.lang.UnsupportedOperationException, exceptionMessage=Cannot add overlapping segments [2017-06-24T12:00:00.000Z/2017-06-24T18:00:00.000Z and 2017-06-24T14:30:00.000Z/2017-06-24T14:45:00.000Z] with the same version [2017-06-24T12:00:04.263Z]}
java.lang.UnsupportedOperationException: Cannot add overlapping segments [2017-06-24T12:00:00.000Z/2017-06-24T18:00:00.000Z and 2017-06-24T14:30:00.000Z/2017-06-24T14:45:00.000Z] with the same version [2017-06-24T12:00:04.263Z]
at io.druid.timeline.VersionedIntervalTimeline.addAtKey(VersionedIntervalTimeline.java:358) ~[druid-common-0.9.1.1.jar:0.9.1.1]
at io.druid.timeline.VersionedIntervalTimeline.add(VersionedIntervalTimeline.java:279) ~[druid-common-0.9.1.1.jar:0.9.1.1]
at io.druid.timeline.VersionedIntervalTimeline.add(VersionedIntervalTimeline.java:109) ~[druid-common-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordinator.helper.DruidCoordinatorCleanupOvershadowed.run(DruidCoordinatorCleanupOvershadowed.java:71) ~[druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordinator.DruidCoordinator$CoordinatorRunnable.run(DruidCoordinator.java:703) [druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordinator.DruidCoordinator$5.call(DruidCoordinator.java:585) [druid-server-0.9.1.1.jar:0.9.1.1]
at io.druid.server.coordinator.DruidCoordinator$5.call(DruidCoordinator.java:578) [druid-server-0.9.1.1.jar:0.9.1.1]
at com.metamx.common.concurrent.ScheduledExecutors$2.run(ScheduledExecutors.java:99) [java-util-0.27.9.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_65]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_65]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_65]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_65]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_65]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_65]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]

``

2017-11-13T00:05:43,164 - ERROR [Master-PeonExec–0:Logger@97] - Server[/druid/loadQueue/10.1.120.121:8080], throwable caught when submitting [SegmentChangeRequestLoad{segment=DataSegment{size=17174, shardSpec=NoneShardSpec, metrics=[…], dimensions=[…], version=‘2017-01-25T18:33:36.397Z’, loadSpec={type=hdfs, path=/apps/druid/data/my_datasource/20170106T000000.000Z_20170125T000000.000Z/2017-01-25T18_33_36.397Z/0/index.zip}, interval=2017-01-06T00:00:00.000Z/2017-01-25T00:00:00.000Z, dataSource=‘my_datasource’, binaryVersion=‘9’}}].
com.metamx.common.ISE: /druid/loadQueue/10.1.120.121:8080/my_datasource_2017-01-06T00:00:00.000Z_2017-01-25T00:00:00.000Z_2017-01-25T18:33:36.397Z was never removed! Failing this operation!
at io.druid.server.coordinator.LoadQueuePeon$1$1.run(LoadQueuePeon.java:234) [druid-server-0.9.1.1.jar:0.9.1.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_65]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_65]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_65]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_65]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_65]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_65]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]

``

Any help will be much appreciated!

Thanks,

Amir

Hi Amir,

The first message indicates that your cluster got into a crazy state somehow. This message:

Cannot add overlapping segments [2017-06-24T12:00:00.000Z/2017-06-24T18:00:00.000Z and 2017-06-24T14:30:00.000Z/2017-06-24T14:45:00.000Z] with the same version [2017-06-24T12:00:04.263Z]

Suggests that you have two segments with the same version but with overlapping and nonidentical intervals. That isn’t supposed to happen, and it may indicate something set up wrong with your ingestion. I think the simplest way to fix it is to figure out which of the two conflicting segments is the one you really want, and then mark the other one unused in the metadata store (set used = false). It would also be good to double check that your ingestion setup is sane. If it is sane then, if possible, it’d be good to track down when those two conflicting segments were created, since it might indicate a bug.

Thanks Gian, we will definitely look into the overlapping segments issue. But just to make sure we’re dealing with the main problem here - can this explain why the merge tasks are not scheduled?

Amir

Hey Amir,

I think it would explain the issue you’re having. The exception in that code path would mean that the coordinator run stops partway through, and doesn’t finish doing all of its tasks. I think running merges is one of the later tasks that would get skipped due to this exception.

Hey Gian,

We marked the overlapping segments as unused, and now the merge tasks are running normally.

Thank you very much for helping out with this!

Cheers,

Amir

Hi Gian,

Im currently using 0.11.0 and noticing the same issue where is there is an overlapping segment with the same version, the coordinator craps out in its scan. So an error form the ingestion side results is no new segments getting recognized by the coordinator. Any reason why this model was chosen over spitting out a metric about faulty overlapped segment and continue scanning the rest of the segments ?

Thanks ,

Sharanya