[Possible Bug] Overlord stuck at retrying "Inserting Task Inserting task index_realtime_*"

In our production Tranquility (0.7.3) + Druid cluster (0.8.3), we occasionally see the realtime index task failed to be functional for the entire segment length (1 hour). The overlord log shows infinite retries on Inserting task as below. I thought this is related to https://github.com/druid-io/druid/pull/2573, so I did the following:

  1. Upgraded Druid to 0.9.2
  2. Inserted a task with the ID that’s will be created in the next hour (05:00)
  3. Waited till 05:00 and was expecting 0.9.2 would not be stuck
  4. The infinite retry of inserting happens again

My question is have anyone saw this problem before? Is https://github.com/druid-io/druid/pull/2573 really related to this problem? Any mitigation recommendation?

2016-12-05 00:03:39,868 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-45] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0, status=RUNNING, duration=-1}
2016-12-05 00:03:39,879 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-72] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1, status=RUNNING, duration=-1}
2016-12-05 00:03:40,321 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-51] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0, status=RUNNING, duration=-1}
2016-12-05 00:03:40,331 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-55] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1, status=RUNNING, duration=-1}
2016-12-05 00:03:40,642 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-46] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0, status=RUNNING, duration=-1}
2016-12-05 00:03:40,653 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-62] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1, status=RUNNING, duration=-1}
2016-12-05 00:03:40,761 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-61] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0, status=RUNNING, duration=-1}
2016-12-05 00:03:40,772 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-59] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1, status=RUNNING, duration=-1}
2016-12-05 00:03:40,782 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-63] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0, status=RUNNING, duration=-1}
2016-12-05 00:03:40,793 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-65] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1, status=RUNNING, duration=-1}
2016-12-05 00:03:40,884 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-70] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0, status=RUNNING, duration=-1}
2016-12-05 00:03:40,895 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-49] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1, status=RUNNING, duration=-1}
2016-12-05 00:03:40,906 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-47] Inserting task index_realtime_stats_60_2016-12-05T00:00:00.000Z_0_0 with status: TaskStatus{id=index_realtime_stats_60_2016-12-05T00:00:00.000Z_0_0, status=RUNNING, duration=-1}
2016-12-05 00:03:40,920 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-53] Inserting task index_realtime_stats_60_2016-12-05T00:00:00.000Z_0_1 with status: TaskStatus{id=index_realtime_stats_60_2016-12-05T00:00:00.000Z_0_1, status=RUNNING, duration=-1}
2016-12-05 00:03:40,951 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-75] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_0, status=RUNNING, duration=-1}
2016-12-05 00:03:40,962 INFO i.d.i.o.MetadataTaskStorage [qtp74735260-76] Inserting task index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1 with status: TaskStatus{id=index_realtime_data_source_2016-12-05T00:00:00.000Z_0_1, status=RUNNING, duration=-1}

On the Tranquility side, it keeps logging “Creating druid indexing task” over and over again, so it’s not clear that if Druid indexing service is retrying or it’s Tranquility trying to creating new task over and over again. If a task already exists, wouldn’t Tranquility just merge beams instead of trying to create new tasks?

To add more, the index task was actually successfully created in Druid, but Tranquility thinks otherwise and keeps throwing " Failed to update cluster state" Bad Request 400. Below log is a Index task log and Tranquility Log, where the index task was created and shut down properly but Tranquility was not able to see it

2016-12-05T06:02:55,765 INFO [task-runner-0] io.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - Registering Eventhandler[data_source-06-0000-0001]
2016-12-05T06:02:55,766 INFO [task-runner-0] io.druid.curator.discovery.CuratorServiceAnnouncer - Announcing service[DruidNode{serviceName=‘data_source-06-0000-0001’, host=‘ebs-stats-druid-indexer-iad5-a-i-7ac70d82.us-east-1.amazon.com’, port=11024}]
2016-12-05T06:02:55,767 INFO [task-runner-0] io.druid.server.metrics.EventReceiverFirehoseRegister - Registering EventReceiverFirehoseMetric for service [firehose:druid:prod:overlord:data_source-06-0000-0001]
2016-12-05T06:02:55,768 INFO [task-runner-0] io.druid.data.input.FirehoseFactory - Firehose created, will shut down at: 2016-12-05T07:25:00.000Z
2016-12-05T07:20:00,002 INFO [data_source-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - Starting merge and push.
2016-12-05T07:20:00,002 INFO [data_source-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] segments. Attempting to hand off segments that start before [1970-01-01T00:00:00.000Z].
2016-12-05T07:20:00,002 INFO [data_source-overseer-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] sinks to persist and merge
2016-12-05T07:25:00,000 INFO [timed-shutoff-firehose-0] io.druid.data.input.FirehoseFactory - Closing delegate firehose.
2016-12-05T07:25:00,000 INFO [timed-shutoff-firehose-0] io.druid.segment.realtime.firehose.EventReceiverFirehoseFactory - Firehose closing.
2016-12-05T07:25:00,001 INFO [timed-shutoff-firehose-0] io.druid.server.metrics.EventReceiverFirehoseRegister - Unregistering EventReceiverFirehoseMetric for service [firehose:druid:prod:overlord:data_source-06-0000-0001]
2016-12-05T07:25:00,001 INFO [timed-shutoff-firehose-0] io.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - Unregistering chat handler[firehose:druid:prod:overlord:data_source-06-0000-0001]
2016-12-05T07:25:00,001 INFO [timed-shutoff-firehose-0] io.druid.curator.discovery.CuratorServiceAnnouncer - Unannouncing service[DruidNode{serviceName=‘firehose:druid:prod:overlord:data_source-06-0000-0001’, host=‘ebs-stats-druid-indexer-iad5-a-i-7ac70d82.us-east-1.amazon.com’, port=11024}]
2016-12-05T07:25:00,133 INFO [task-runner-0] io.druid.indexing.common.task.RealtimeIndexTask - Persisting remaining data.
2016-12-05T07:25:00,135 INFO [task-runner-0] io.druid.segment.realtime.plumber.RealtimePlumber - Submitting persist runnable for dataSource[data_source]
2016-12-05T07:25:00,138 INFO [task-runner-0] io.druid.indexing.common.task.RealtimeIndexTask - Finishing the job.
2016-12-05T07:25:00,138 INFO [task-runner-0] io.druid.segment.realtime.plumber.RealtimePlumber - Shutting down…
2016-12-05T07:25:00,138 INFO [task-runner-0] io.druid.indexing.common.task.RealtimeIndexTask - Job done!
2016-12-05T07:25:00,144 INFO [task-runner-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_realtime_data_source_2016-12-05T06:00:00.000Z_0_1”,
“status” : “SUCCESS”,
“duration” : 4924475
}

Tranquility Log:
05 Dec 2016 07:01:17,176 [ERROR] (ClusteredBeam-ZkFuturePool-e1027048-dc04-41ee-9028-e33fd93a170b) com.metamx.tranquility.beam.ClusteredBeam: Failed to update cluster state: druid:prod:overlord/data_source
com.metamx.tranquility.druid.IndexServicePermanentException: Service[druid:prod:overlord] call failed with status: 400 Bad Request
at com.metamx.tranquility.druid.IndexService$$anonfun$call$1$$anonfun$apply$7.apply(IndexService.scala:135)
at com.metamx.tranquility.druid.IndexService$$anonfun$call$1$$anonfun$apply$7.apply(IndexService.scala:110)
at com.twitter.util.Future$$anonfun$map$1$$anonfun$apply$6.apply(Future.scala:950)
at com.twitter.util.Try$.apply(Try.scala:13)
at com.twitter.util.Future$.apply(Future.scala:97)
at com.twitter.util.Future$$anonfun$map$1.apply(Future.scala:950)
at com.twitter.util.Future$$anonfun$map$1.apply(Future.scala:949)
at com.twitter.util.Promise$Transformer.liftedTree1$1(Promise.scala:112)
at com.twitter.util.Promise$Transformer.k(Promise.scala:112)
at com.twitter.util.Promise$Transformer.apply(Promise.scala:122)
at com.twitter.util.Promise$Transformer.apply(Promise.scala:103)
at com.twitter.util.Promise$$anon$1.run(Promise.scala:366)

It looks like it took more than 1.minutes+30.seconds for Overlord to respond to Tranquility. The slowness was between inserting task into Mysql and Druid trying to add the task to active tasks. Seems to be transient issue that lead Druid + tranquility into messed up state.

Trying to find a place where I can update this config from 1.minutes+30.seconds to something higher. Should Overlord also throw a proper Exception to Tranquility saying the task already Active and running, thus need to create new beam instead of throwing a 400 Bad Request?

05 Dec 2016 16:02:45,559 [INFO] (ClusteredBeam-ZkFuturePool-b8fc5727-e4c7-4c03-95c0-b778b556f249) com.metamx.emitter.core.LoggingEmitter: Event [{“feed”:“alerts”,“timestamp”:“2016-12-05T16:02:45.559Z”,“service”:“tranquility”,“host”:“localhost”,“severity”:“anomaly”,“description”:“Failed to create merged beam: druid:prod:overlord/data_source”,“data”:{“exceptionType”:“java.lang.IllegalStateException”,“exceptionMessage”:“Failed to save new beam for identifier[druid:prod:overlord/data_source] timestamp[2016-12-05T16:00:00.000Z]”,“exceptionStackTrace”:“java.lang.IllegalStateException: Failed to save new beam for identifier[druid:prod:overlord/data_source] timestamp[2016-12-05T16:00:00.000Z]\n\tat com.metamx.tranquility.beam.ClusteredBeam$$anonfun$2.applyOrElse(ClusteredBeam.scala:286)\n\tat com.metamx.tranquility.beam.ClusteredBeam$$anonfun$2.applyOrElse(ClusteredBeam.scala:283)\n\tat com.twitter.util.Future$$anonfun$rescue$1.apply(Future.scala:924)\n\tat com.twitter.util.Future$$anonfun$rescue$1.apply(Future.scala:922)\n\tat com.twitter.util.Promise$Transformer.liftedTree1$1(Promise.scala:112)\n\tat com.twitter.util.Promise$Transformer.k(Promise.scala:112)\n\tat com.twitter.util.Promise$Transformer.apply(Promise.scala:122)\n\tat com.twitter.util.Promise$Transformer.apply(Promise.scala:103)\n\tat com.twitter.util.Promise$$anon$1.run(Promise.scala:366)\n\tat com.twitter.concurrent.LocalScheduler$Activation.run(Scheduler.scala:178)\n\tat com.twitter.concurrent.LocalScheduler$Activation.submit(Scheduler.scala:136)\n\tat com.twitter.concurrent.LocalScheduler.submit(Scheduler.scala:207)\n\tat com.twitter.concurrent.Scheduler$.submit(Scheduler.scala:92)\n\tat com.twitter.util.Promise.runq(Promise.scala:350)\n\tat com.twitter.util.Promise.updateIfEmpty(Promise.scala:721)\n\tat com.twitter.util.ExecutorServiceFuturePool$$anon$2.run(FuturePool.scala:107)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat java.lang.Thread.run(Thread.java:745)\nCaused by: com.twitter.finagle.GlobalRequestTimeoutException: exceeded 1.minutes+30.seconds to druid:prod:overlord while waiting for a response for the request, including retries (if applicable)\n\tat com.twitter.finagle.NoStacktrace(Unknown Source)\n”}}]