Having trouble with stop/shutdown kafka indexing supervisor jobs

Hi, fairly new to Druid and am primarily using the kafka indexing service. Having trouble shutting down running kafka indexing data sources.

curl -XGET http://IP:PORT/druid/indexer/v1/supervisor

returns an empty list, as it should, since I’ve run this:

curl -XPOST http://IP:PORT/druid/indexer/v1/supervisor/DATASOURCE/shutdown

on each dataSource in the kafka indexing specs.

However, for days now I still see activity in the logs (example below). Also, oddly, some data sources are still indexing incoming data, others are not. (viewing data in Pivot)

I’ve restarted all nodes and no change.

I’ve also tried killing running tasks I see in the coordinator console using

curl -X POST -H ‘Content-Type: application/json’ -d @kill-task.json http://HOST:PORT/druid/indexer/v1/task

and the response is

{“error”:“Task[index_kafka_DATASOURCE_54d9e987be475d5_kdejoddf] already exists!”}

which makes no sense to me

What’s the trick to managing what the cluster is doing? Have I overloaded it and it’s just really slowly catching up?

Thanks for any help!

task that is actively indexing data, even though does not show up in GET /supervisor call

2016-10-29T13:23:30,550 INFO [qtp26418585-82] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_kafka_DATASOURCE2_54d9e987be475d5_kdejoddf]: SegmentAllocateAction{dataSource=‘DATASOURCE2’, timestamp=2016-10-29T13:23:30.501Z, queryGranularity=NoneGranularity, preferredSegmentGranularity=MINUTE, sequenceName=‘index_kafka_DATASOURCE2_54d9e987be475d5_1’, previousSegmentId=‘DATASOURCE2_2016-10-29T13:21:00.000Z_2016-10-29T13:22:00.000Z_2016-10-29T13:21:30.194Z_3’}

task that is NOT actively indexing data:

2016-10-29T13:11:33,468 INFO [qtp1044965465-43] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-10-29T13:11:33.468Z”,“service”:“druid/historical”,“host”:“DRUIDHOST:9083”,“metric”:“query/bytes”,“value”:870,“context”:"{“finalize”:false,“queryId”:“ceabeae2-922a-4d07-9233-edb7c6f1da56”,“timeout”:40000}",“dataSource”:“DATASOURCE”,“duration”:“PT180120S”,“hasFilters”:“false”,“id”:“ceabeae2-922a-4d07-9233-edb7c6f1da56”,“interval”:[“2016-10-13T00:49:00.000Z/2016-10-15T01:15:00.000Z”,“2016-10-15T01:52:00.000Z/2016-10-15T02:45:00.000Z”,“2016-10-15T09:42:00.000Z/2016-10-15T09:44:00.000Z”,“2016-10-15T14:08:00.000Z/2016-10-15T14:49:00.000Z”],“remoteAddress”:“xxxx”,“type”:“segmentMetadata”,“version”:“0.9.1.1”}]

2016-10-29T13:11:33,407 INFO [HttpClient-Netty-Worker-3] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“metrics”,“timestamp”:“2016-10-29T13:11:33.407Z”,“service”:“druid/broker”,“host”:“DRUIDHOST:9082”,“metric”:“query/node/ttfb”,“value”:8695,“dataSource”:“DATASOURCE”,“duration”:“PT180120S”,“hasFilters”:“false”,“id”:“ceabeae2-922a-4d07-9233-edb7c6f1da56”,“interval”:[“2016-10-13T00:49:00.000Z/2016-10-15T01:15:00.000Z”,“2016-10-15T01:52:00.000Z/2016-10-15T02:45:00.000Z”,“2016-10-15T09:42:00.000Z/2016-10-15T09:44:00.000Z”,“2016-10-15T14:08:00.000Z/2016-10-15T14:49:00.000Z”],“server”:“DRUIDHOST:9083”,“type”:“segmentMetadata”,“version”:“0.9.1.1”}]

Hey Scott,

When you shutdown the supervisor, it attempts to stop the indexing tasks it was managing, but if the tasks don’t respond before timeout, the supervisor will just exit and the tasks may remain. There’s been some improvements in 0.9.2 that should make the KafkaIndexTask more responsive to commands from the supervisor even when it’s in a bad state.

Regarding killing tasks, I believe you’re using the wrong command to try to kill a task. What you’re doing is submitting a KillTask, which is used to remove the metadata for a segment and remove it from deep storage, and you’re giving your kill task the ID of an existing task which is why it gets rejected. The command you’re actually looking for is:

POST http://<OVERLORD_IP>:<port>/druid/indexer/v1/task/{taskId}/shutdown

as described here: http://druid.io/docs/0.9.1.1/design/indexing-service.html

If you post the full task log of one of the tasks that isn’t exiting when the supervisor is shutdown, I can check to see if there’s any clues in there why it’s not shutting down.

Hey David,

I’m using Kafka 1.1.0 and the problem still persists. When I submit a supervisor spec to the server, it creates a supervisor and then creates an indexing task for the same. Now, shutting the supervisor doesn’t kill the indexing task immediately. I need to manually kill them.

Hey Suhas,

Stopping the supervisor does not kill the indexing task immediately. Under normal circumstances, when the supervisor is stopped, it will attempt to signal the indexing tasks to stop reading and begin publishing. You can see if this is happening by searching for ‘Pausing ingestion until resumed’ in the task logs and reading the log messages following that. This publishing process can take minutes or even a few hours depending on the quantity/complexity of data read.

If the supervisor is not able to signal to the indexing tasks to stop reading, then the supervisor will exit without touching the indexing tasks (i.e. it will not force kill them). If this is what’s happening, check your overlord logs to see why it was unable to send a request to the task.

Hey David,

This is happening very sporadically. I’m unable to reproduce this issue. But under normal circumstances, shutting down the supervisor finishes the task remaining and marks it as “Success.” But on some days, shutting down the supervisor doesn’t finish the task at all and runs forever. The overlord log shows “index_kafka_id” paused successfully. And it gets hung there forever.

I notice the same thing. We kill the supervisor tasks and the subtasks do not die. Very frustrating because the ids of the subtasks are like index_kafka_thing_f94ddb4adea3e44_dlepgcee . and it takes extra scripting to shutdown the subtasks.