Indexation problem and task impossible to kill after failure

Hi,

I’m stuck in an issue regarding an indexation problem on a specific platform based on a in-house docker image mostly inspired from single-server configurations.

Please note that in many circumstances, tasks are correctly indexed, but in some yet unidentified cases the task is blocked and can not be killed after failure.

The task log refers impossibility to access to the coordinator

2020-03-23T16:20:46,672 WARN [task-runner-0-priority-0] org.apache.druid.discovery.DruidLeaderClient - Request[http://localhost:8081/druid/indexer/v1/action] failed.
org.jboss.netty.handler.timeout.ReadTimeoutException
at org.jboss.netty.handler.timeout.ReadTimeoutHandler.(ReadTimeoutHandler.java:84) ~[netty-3.10.6.Final.jar:?]
at org.apache.druid.java.util.http.client.NettyHttpClient.go(NettyHttpClient.java:172) ~[druid-core-0.16.1-incubating.jar:0.16.1-incubating]
at org.apache.druid.java.util.http.client.AbstractHttpClient.go(AbstractHttpClient.java:33) ~[druid-core-0.16.1-incubating.jar:0.16.1-incubating]

A firewall issue is not in cause : all components are available from the same docker component, and the url is valid and responding to status query :

curl http://localhost:8081/status
{“version”:“0.16.1-incubating”,“modules”:[{“name”:“org.apache.druid.common.aws.AWSModule”,“artifact”:“druid-aws-common”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.common.gcp.GcpModule”,“artifact”:“druid-gcp-common”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.metadata.storage.postgresql.PostgreSQLMetadataStorageModule”,“artifact”:“postgresql-metadata-storage”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.query.aggregation.datasketches.theta.SketchModule”,“artifact”:“druid-datasketches”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.query.aggregation.datasketches.theta.oldapi.OldApiSketchModule”,“artifact”:“druid-datasketches”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.query.aggregation.datasketches.quantiles.DoublesSketchModule”,“artifact”:“druid-datasketches”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchModule”,“artifact”:“druid-datasketches”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.query.aggregation.datasketches.hll.HllSketchModule”,“artifact”:“druid-datasketches”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.indexing.kafka.KafkaIndexTaskModule”,“artifact”:“druid-kafka-indexing-service”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.server.lookup.namespace.NamespaceExtractionModule”,“artifact”:“druid-lookups-cached-global”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.query.aggregation.histogram.ApproximateHistogramDruidModule”,“artifact”:“druid-histogram”,“version”:“0.16.1-incubating”},{“name”:“org.apache.druid.query.aggregation.TimestampMinMaxModule”,“artifact”:“druid-processing”,“version”:“0.16.1-incubating”}],“memory”:{“maxMemory”:268435456,“totalMemory”:268435456,“freeMemory”:52371728,“usedMemory”:216063728,“directMemory”:268435456}}

The task log is preceded by a great amount of lines such as :

2020-03-23T15:32:43,579 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Attempting to lock file[var/druid/task/index_parallel_160_2020-03-23T15:32:10.645Z/lock].
2020-03-23T15:32:43,581 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Acquired lock file[var/druid/task/index_parallel_160_2020-03-23T15:32:10.645Z/lock] in 2ms.
2020-03-23T15:32:43,584 INFO [main] org.apache.druid.indexing.common.task.AbstractBatchIndexTask - [forceTimeChunkLock] is set to true in task context. Use timeChunk lock
2020-03-23T15:32:43,603 INFO [main] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_parallel_160_2020-03-23T15:32:10.645Z]: TimeChunkLockTryAcquireAction{, type=EXCLUSIVE, interval=2006-01-13T00:00:00.000Z/2006-01-14T00:00:00.000Z}
2020-03-23T15:32:43,644 INFO [main] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Submitting action for task[index_parallel_160_2020-03-23T15:32:10.645Z] to overlord: [TimeChunkLockTryAcquireAction{, type=EXCLUSIVE, interval=2006-01-13T00:00:00.000Z/2006-01-14T00:00:00.000Z}].
2020-03-23T15:32:43,783 INFO [main] org.apache.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_parallel_160_2020-03-23T15:32:10.645Z]: TimeChunkLockTryAcquireAction{, type=EXCLUSIVE, interval=2023-10-23T00:00:00.000Z/2023-10-24T00:00:00.000Z}

I suscpect a configuration issue of this specific platform : any hints to investigate ?

Regards

Hi DanC,
Any errors/warnings/exceptions/failures in task log and overlord log related to this index_parallel task?

Thanks,

–siva