Indexing task killed after 2 seconds

I started a fresh Druid cluster and submitted a kafka supervisor spec. The first indexing task appears to have succeeded, but the second was killed while it was still starting up - maybe a race condition? Is there a setting for how long the overlord waits for an indexing task to come up? In the overlord log I see:

2016-07-21T18:28:21,328 INFO [KafkaSupervisor-business_events_test-0] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://druid-data001.internal.shopkick.com:8100

2016-07-21T18:28:21,355 INFO [KafkaSupervisor-business_events_test-0] io.druid.indexing.kafka.supervisor.KafkaSupervisor - Creating new task group [0] for partitions [0, 1, 2, 3, 4, 5, 6, 7]

2016-07-21T18:28:21,355 INFO [KafkaSupervisor-business_events_test-0] io.druid.indexing.kafka.supervisor.KafkaSupervisor - Number of tasks [0] does not match configured numReplicas [1] in task group [0], creating more tasks

2016-07-21T18:28:21,358 INFO [KafkaSupervisor-business_events_test-0] io.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_kafka_business_events_test_d799231bf4224a8_bbgjchle with status: TaskStatus{id=index_kafka_business_events_test_d799231bf4224a8_bbgjchle, status=RUNNING, duration=-1}

2016-07-21T18:28:21,390 INFO [KafkaSupervisor-business_events_test-0] io.druid.indexing.overlord.TaskLockbox - Adding task[index_kafka_business_events_test_d799231bf4224a8_bbgjchle] to activeTasks

2016-07-21T18:28:21,390 INFO [KafkaSupervisor-business_events_test-0] io.druid.indexing.kafka.supervisor.KafkaSupervisor - {id=‘business_events_test’, generationTime=2016-07-21T18:28:21.390Z, payload={dataSource=‘business_events_test’, topic=‘raw_shopkick_pylons_weblog_avro_v1’, partitions=8, replicas=1, durationSeconds=3600, active=, publishing=[{id=‘index_kafka_business_events_test_b7b30f41cef94ef_imejjjaf’, startTime=2016-07-21T17:27:31.988Z, remainingSeconds=1799}]}}

2016-07-21T18:28:21,390 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_kafka_business_events_test_d799231bf4224a8_bbgjchle

2016-07-21T18:28:21,391 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_kafka_business_events_test_d799231bf4224a8_bbgjchle

2016-07-21T18:28:21,392 WARN [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - No worker selections strategy set. Using default.

2016-07-21T18:28:21,392 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[druid-data001.internal.shopkick.com:8091] to add task[index_kafka_business_events_test_d799231bf4224a8_bbgjchle]

2016-07-21T18:28:21,406 INFO [KafkaSupervisor-business_events_test-0] io.druid.indexing.kafka.supervisor.KafkaSupervisor - {id=‘business_events_test’, generationTime=2016-07-21T18:28:21.406Z, payload={dataSource=‘business_events_test’, topic=‘raw_shopkick_pylons_weblog_avro_v1’, partitions=8, replicas=1, durationSeconds=3600, active=[{id=‘index_kafka_business_events_test_d799231bf4224a8_bbgjchle’, startTime=null, remainingSeconds=null}], publishing=[{id=‘index_kafka_business_events_test_b7b30f41cef94ef_imejjjaf’, startTime=2016-07-21T17:27:31.988Z, remainingSeconds=1799}]}}

2016-07-21T18:28:21,427 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Task index_kafka_business_events_test_d799231bf4224a8_bbgjchle switched from pending to running (on [druid-data001.internal.shopkick.com:8091])2016-07-21T18:28:21,509 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-data001.internal.shopkick.com:8091] wrote

RUNNING status for task [index_kafka_business_events_test_d799231bf4224a8_bbgjchle] on [TaskLocation{host=‘druid-data001.internal.shopkick.com’, port=8102}]

2016-07-21T18:28:21,509 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_business_events_test_d799231bf4224a8_bbgjchle] location changed to [TaskLocation{host=‘druid-data001.internal.shopkick.com’, port=8102}].

2016-07-21T18:28:21,427 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_business_events_test_d799231bf4224a8_bbgjchle] status changed to [RUNNING].

2016-07-21T18:28:21,479 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-data001.internal.shopkick.com:8091] wrote RUNNING status for task [index_kafka_business_events_test_d799231bf4224a8_bbgjchle] on [TaskLocation{host=‘null’, port=-1}]

2016-07-21T18:28:24,640 WARN [KafkaSupervisor-business_events_test-0] io.druid.indexing.kafka.supervisor.KafkaSupervisor - Task [index_kafka_business_events_test_d799231bf4224a8_bbgjchle] failed to return start time, killing task

java.lang.RuntimeException: java.net.ConnectException: Connection refused

    at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

    at io.druid.indexing.kafka.KafkaIndexTaskClient.submitRequest(KafkaIndexTaskClient.java:328) ~[druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.KafkaIndexTaskClient.submitRequest(KafkaIndexTaskClient.java:233) ~[druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.KafkaIndexTaskClient.getStartTime(KafkaIndexTaskClient.java:156) ~[druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.supervisor.KafkaSupervisor.getTaskStartTime(KafkaSupervisor.java:1218) ~[druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.supervisor.KafkaSupervisor.updateTaskStatus(KafkaSupervisor.java:708) [druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.supervisor.KafkaSupervisor.runInternal(KafkaSupervisor.java:457) [druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.supervisor.KafkaSupervisor$RunNotice.handle(KafkaSupervisor.java:409) [druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at io.druid.indexing.kafka.supervisor.KafkaSupervisor$2.run(KafkaSupervisor.java:261) [druid-kafka-indexing-service-0.9.1.jar:0.9.1]

    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_91]

    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]

    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

Caused by: java.net.ConnectException: Connection refused

    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_91]

    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_91]

    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_91]

    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_91]

    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_91]

    at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_91]

    at java.net.Socket.connect(Socket.java:538) ~[?:1.8.0_91]

    at java.net.Socket.<init>(Socket.java:434) ~[?:1.8.0_91]

    at java.net.Socket.<init>(Socket.java:211) ~[?:1.8.0_91]

    at io.druid.indexing.kafka.KafkaIndexTaskClient.checkConnection(KafkaIndexTaskClient.java:228) ~[?:?]

    at io.druid.indexing.kafka.KafkaIndexTaskClient.submitRequest(KafkaIndexTaskClient.java:264) ~[?:?]

    ... 12 more

2016-07-21T18:28:24,649 INFO [KafkaSupervisor-business_events_test-0] com.metamx.http.client.pool.ChannelResourceFactory - Generating: http://druid-data001.internal.shopkick.com:8091

2016-07-21T18:28:24,682 INFO [KafkaSupervisor-business_events_test-0] io.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to worker: druid-data001.internal.shopkick.com:8091, status 200 OK, response: {“task”:“index_kafka_business_events_test_d799231bf4224a8_bbgjchle”}

2016-07-21T18:28:24,682 ERROR [KafkaSupervisor-business_events_test-0] io.druid.indexing.overlord.RemoteTaskRunner - Shutdown failed for index_kafka_business_events_test_d799231bf4224a8_bbgjchle! Are you sure the task was running?

The last line in the indexing task log was:

2016-07-21T18:28:24,053 INFO [main] io.druid.initialization.Initialization - Adding local file system extension module [io.druid.indexing.kafka.KafkaIndexTaskModule] for class [io.druid.initialization.DruidModule]

My overlord config is just:
druid.service=druid/overlord

druid.port=8090

druid.indexer.queue.startDelay=PT30S

druid.indexer.storage.recentlyFinishedThreshold=PT24H

druid.indexer.runner.type=remote

druid.indexer.storage.type=metadata

Thanks for the help,

–Ben

Hey Ben,

This looks like a bug to me. The timeout for the task to respond isn’t currently configurable, but it should have been something like 60 seconds. The bug is that the call to getStartTime doesn’t retry until timeout but fails after the first attempt, hence if the HTTP endpoint isn’t responsive before the supervisor pokes it, the supervisor will kill the task.

Do you happen to have an aggressive ‘period’ defined in your supervisor spec (by aggressive I mean < 5 seconds or so). If so, increasing it may help, although the race condition is still there since the task created event will also trigger a supervisor task scan.

The good news is that if tasks are sometimes starting up okay and sometimes failing, there aren’t really any adverse effects of the failed task. If however none of your tasks will start before being killed, you may have to patch the extension with a fix and rebuild it.

I’ve filed an issue for it here with a comment about the small change that I think should fix it: https://github.com/druid-io/druid/issues/3276

Let me know if the issue continues to happen at an unacceptable level without a fix and we can figure out something that works for you.

Thanks David. It isn’t causing problems and doesn’t happen very frequently. I can wait for the fix to filter through to a release.