I’m using Tranquility to stream events into Druid. It has an hourly granularity. All of a sudden, now I see a task in “Waiting Tasks - Tasks waiting on locks” section. It has been there for a few hours now and has blocked every hourly ingestion task since.
Running, http://localhost:8090/druid/indexer/v1/task/index_realtime_ds_2018-05-25T06:00:00.000Z_0_0/status
gives status as running and duration as -1.
Doing, curl -XPOST http://localhost:8090/druid/indexer/v1/task/index_realtime_ds_2018-05-25T06:00:00.000Z_0_0/shutdown
seems to take forever, and then returns back, {“task”: “index_realtime_ds_2018-05-25T06:00:00.000Z_0_0”}
Using the Overlord Console is to no avail since using the kill from there returns “Kill request failed with status: 0 please check overlord logs.”
I even tried shutting down the firehose by executing
curl -XPOST http://0.0.0.0:8100/druid/worker/v1/chat/firehose:druid:overlord:ds-006-0000-0000/shutdown
It returns, connection refused, meaning the firehose has already shut down.
There are no exceptions in the logs. However, in the Tranquility log,
c.m.t.server.http.TranquilityServlet - Server error serving request to http://172.30.1.234:8200/v1/post/ds
java.lang.IllegalStateException: Failed to create merged beam: druid:overlord/ds
``
Caused by: com.twitter.finagle.GlobalRequestTimeoutException: exceeded 1.minutes+30.seconds to disco!druid:overlord while waiting for a response for the request, including retries (if applicable)
``
ERROR c.m.tranquility.beam.ClusteredBeam - Failed to update cluster state: druid:overlord/ds
com.twitter.finagle.GlobalRequestTimeoutException: exceeded 1.minutes+30.seconds to disco!druid:overlord while waiting for a response for the request, including retries (if applicable)
at com.twitter.finagle.NoStacktrace(Unknown Source) ~[na:na]
2018-05-25 11:34:33,654 [ClusteredBeam-ZkFuturePool-8176ebe2-2cea-4040-b050-efbb8da56ff0] WARN c.m.tranquility.beam.ClusteredBeam - Emitting alert: [anomaly] Failed to create merged beam: druid:overlord/ds
``
I believe I’m getting each of those errors when it fails to create a new task for realtime ingestion for every hour.
How should I free the task waiting on lock?