KIS tasks stuck in reboot loop in production cluster!

this is on my production cluster and it just happened out of the blue!
druid 0.12.1

no real exception on either MiddleManger or overlord+coordinator node.

I even tried deleting the /tmp/persist directory and restarting the node.

some logs from coordinator:

2018-08-13 14:23:56,238 INFO i.d.s.c.h.DruidCoordinatorLogger [Coordinator-Exec–0] [_default_tier] : Dropped 0 segments among 2 servers

2018-08-13 14:23:56,238 INFO i.d.s.c.h.DruidCoordinatorLogger [Coordinator-Exec–0] [_default_tier] : Moved 0 segment(s)

2018-08-13 14:23:56,238 INFO i.d.s.c.h.DruidCoordinatorLogger [Coordinator-Exec–0] [_default_tier] : Let alone 0 segment(s)

2018-08-13 14:23:56,238 INFO i.d.s.c.h.DruidCoordinatorLogger [Coordinator-Exec–0] Load Queues:

2018-08-13 14:23:56,238 INFO i.d.s.c.h.DruidCoordinatorLogger [Coordinator-Exec–0] Server[10.240.0.3:8080, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 220,974,951,573 bytes served.

2018-08-13 14:23:56,238 INFO i.d.s.c.h.DruidCoordinatorLogger [Coordinator-Exec–0] Server[10.240.0.2:8080, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 220,974,951,573 bytes served.

2018-08-13 14:23:59,727 INFO i.d.j.u.h.c.p.ChannelResourceFactory [KafkaIndexTaskClient-apm-minute-0] Generating: http://10.240.0.5:8100

2018-08-13 14:24:03,678 INFO i.d.i.o.TaskQueue [TaskQueue-StorageSync] Synced 16 tasks from storage (0 tasks added, 0 tasks removed).

2018-08-13 14:24:03,688 INFO i.d.s.l.c.LookupCoordinatorManager [LookupCoordinatorManager–2] Not updating lookups because no data exists

2018-08-13 14:24:05,551 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-txntrace-minute-0] No TaskLocation available for task [index_kafka_txntrace-minute_9625e02f31440c8_ckkenpdh], this task may not have been assigned to a worker yet or may have already completed

2018-08-13 14:24:05,558 INFO i.d.i.k.s.KafkaSupervisor [KafkaSupervisor-txntrace-minute] {id=‘txntrace-minute’, generationTime=2018-08-13T14:24:05.558Z, payload={dataSource=‘txntrace-minute’, topic=‘txntraces’, partitions=1, replicas=2, durationSeconds=43200, active=[{id=‘index_kafka_txntrace-minute_9625e02f31440c8_bappcadh’, startTime=2018-08-13T05:55:14.677Z, remainingSeconds=12669}, {id=‘index_kafka_txntrace-minute_9625e02f31440c8_ckkenpdh’, startTime=null, remainingSeconds=null}], publishing=}}

2018-08-13 14:24:05,956 INFO i.d.j.u.h.c.p.ChannelResourceFactory [KafkaIndexTaskClient-infraintegs-minute-1] Generating: http://10.240.0.5:8105

2018-08-13 14:24:05,964 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-infraintegs-minute-1] submitRequest failed for [http://10.240.0.4:8107/druid/worker/v1/chat/index_kafka_infraintegs-minute_61bdf910bba8b6a_ekhhghff/offsets/current], with message [Connection refused (Connection refused)]

2018-08-13 14:24:08,956 INFO i.d.m.SQLMetadataRuleManager [DatabaseRuleManager-Exec–0] Polled and found rules for 12 datasource(s)

2018-08-13 14:24:15,568 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-infraserver-minute-1] submitRequest failed for [http://10.240.0.4:8106/druid/worker/v1/chat/index_kafka_infraserver-minute_bd2e8ad6464e44a_cdfpcmgj/offsets/current], with message [Connection refused (Connection refused)]

2018-08-13 14:24:15,574 INFO i.d.j.u.h.c.p.ChannelResourceFactory [KafkaIndexTaskClient-infraserver-minute-1] Generating: http://10.240.0.5:8104

2018-08-13 14:24:20,545 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-txntrace-minute-0] No TaskLocation available for task [index_kafka_txntrace-minute_9625e02f31440c8_ckkenpdh], this task may not have been assigned to a worker yet or may have already completed

2018-08-13 14:24:20,546 INFO i.d.j.u.h.c.p.ChannelResourceFactory [KafkaIndexTaskClient-txntrace-minute-1] Generating: http://10.240.0.5:8101

2018-08-13 14:24:22,336 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-sqltrace-minute-0] submitRequest failed for [http://10.240.0.4:8102/druid/worker/v1/chat/index_kafka_sqltrace-minute_ceadf276f5e05e5_pkllidjf/offsets/current], with message [Connection refused (Connection refused)]

2018-08-13 14:24:22,340 INFO i.d.j.u.h.c.p.ChannelResourceFactory [KafkaIndexTaskClient-sqltrace-minute-0] Generating: http://10.240.0.5:8103

2018-08-13 14:24:22,889 INFO i.d.j.u.h.c.p.ChannelResourceFactory [KafkaIndexTaskClient-pctile-hour-0] Generating: http://10.240.0.5:8102

2018-08-13 14:24:22,895 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-pctile-hour-0] submitRequest failed for [http://10.240.0.4:8104/druid/worker/v1/chat/index_kafka_pctile-hour_cc3ace5c8734d83_dclhcief/offsets/current], with message [Connection refused (Connection refused)]

2018-08-13 14:24:22,927 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-exceptiontrace-minute-1] submitRequest failed for [http://10.240.0.4:8103/druid/worker/v1/chat/index_kafka_exceptiontrace-minute_93f2744d957e07f_hcebkepd/offsets/current], with message [Connection refused (Connection refused)]

2018-08-13 14:24:22,930 INFO i.d.j.u.h.c.p.ChannelResourceFactory [KafkaIndexTaskClient-exceptiontrace-minute-1] Generating: http://10.240.0.5:8107

2018-08-13 14:24:23,037 INFO i.d.j.u.h.c.p.ChannelResourceFactory [KafkaIndexTaskClient-pctile-minute-1] Generating: http://10.240.0.5:8106

2018-08-13 14:24:23,045 INFO i.d.i.k.KafkaIndexTaskClient [KafkaIndexTaskClient-pctile-minute-1] submitRequest failed for [http://10.240.0.4:8101/druid/worker/v1/chat/index_kafka_pctile-minute_115d3f45427a920_lkcefbkg/offsets/current], with message [Connection refused (Connection refused)]

upgraded to 0.12.2 and still same issue.
please help!!

i do see this in the middle manager logs:

2018-08-13 15:23:38,862 INFO i.d.i.o.ForkingTaskRunner [forking-task-runner-4] Exception caught during execution
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) ~[?:1.8.0_181 ]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:291) ~[?:1.8.0_181 ]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_181 ]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:1.8.0_181 ]
at com.google.common.io.ByteStreams.copy(ByteStreams.java:175) ~[guava-16.0.1.jar: ?]
at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:452) [druid-indexing-service-0.12.2.jar:0.12.2 ]
at io.druid.indexing.overlord.ForkingTaskRunner$1.call(ForkingTaskRunner.java:224) [druid-indexing-service-0.12.2.jar:0.12.2 ]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181 ]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181 ]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181 ]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

Hi Prashant,

I wrote in on the issue you raised (https://github.com/apache/incubator-druid/issues/6166) to ask some questions.