Infinite segment load and gaps in hourly data

I’m looking at the coordinator node and the number of segments drops but then increases and constantly hovers around the same number. When I look at the box I end up seeing the following exception about a particular job not being removed:

com.metamx.common.ISE: /druid/v1/prod/loadQueue/…/druid_job_2016-03-04T15:00:00.000Z_2016-03-04T16:00:00.000Z_2016-03-04T17:23:32.386Z was never removed! Failing this operation!

    at io.druid.server.coordinator.LoadQueuePeon$1$1.run(LoadQueuePeon.java:236) [druid-server-0.8.3.jar:0.8.3]

    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [?:1.7.0_85]

    at java.util.concurrent.FutureTask.run(FutureTask.java:262) [?:1.7.0_85]

    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) [?:1.7.0_85]

    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) [?:1.7.0_85]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_85]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_85]

    at java.lang.Thread.run(Thread.java:745) [?:1.7.0_85]

I suspect this also leads to gaps in the data since those segments seem to never get loaded.

So it looks as if rerunning the jobs manually is probabilistic - sometimes they succeed and sometimes they fail. Any idea what could be happening here?

Issue ended up resolving itself after I restarted the historical and coordinator nodes.

Which version of Druid?

We’re using 0.8.3.

I suspect it got into a weird loop where the segments weren’t being loaded properly and kept on getting requeued. The restart probably cleared that up since it all went smoothly from then on.

Logs on the historicals would be good to know about what happened.

Sorry but I ended up purging them.