[druid-user] Batch injestion with multiple subtasks is failing on random nodes

we have 25 node cluster (one master with zoo keeper) and one broker and rest data nodes. one of the index parallel job with 10 subtasks and each task ingesting 3 files from HDFS. This was working fine but recently the random nodes failing with the below errors.
Any idea where to look in detail, When we start the same job with single subtask its completing fine.

java.lang.RuntimeException: java.lang.InterruptedException
at org.apache.druid.client.indexing.HttpIndexingServiceClient.getTaskStatus(HttpIndexingServiceClient.java:279) ~[druid-server-0.21.0.jar:0.21.0]
at org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor.lambda$start$0(TaskMonitor.java:128) ~[druid-indexing-service-0.21.0.jar:0.21.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_291]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_291]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_291]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_291]
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_291]
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) ~[guava-16.0.1.jar:?]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:143) ~[druid-server-0.21.0.jar:0.21.0]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:127) ~[druid-server-0.21.0.jar:0.21.0]
at org.apache.druid.client.indexing.HttpIndexingServiceClient.getTaskStatus(HttpIndexingServiceClient.java:264) ~[druid-server-0.21.0.jar:0.21.0]
… 8 more
2021-09-20T22:32:11,200 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_pfkageka_2021-09-20T22:31:28.197Z]
2021-09-20T22:32:11,205 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_ohdacibj_2021-09-20T22:30:02.426Z]
2021-09-20T22:32:11,212 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_iiengljh_2021-09-20T22:30:02.501Z]
2021-09-20T22:32:11,224 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_gganjbho_2021-09-20T22:31:29.196Z]
2021-09-20T22:32:11,237 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Stopped taskMonitor
2021-09-20T22:32:11,240 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_parallel_US_SCORECARD_DEV.DAILY_SALES_DTL_inppkpbh_2021-09-20T22:29:55.597Z”,
“status” : “FAILED”,
“duration” : 129390,
“errorMsg” : null,
“location” : {
“host” : null,
“port” : -1,
“tlsPort” : -1
}
}

Any inputs to look, We are still seeing the failures.
Config portion of payload. When we have the maxNumConcurrentSubTasks": as 10 the job fails on random nodes with the above error and also we have seen that

},
“tuningConfig”: {
“type”: “index_parallel”,
“maxRowsPerSegment”: 5000000,
“appendableIndexSpec”: {
“type”: “onheap”
},
“maxRowsInMemory”: 300000,
“maxBytesInMemory”: 0,
“maxTotalRows”: null,
“numShards”: null,
“splitHintSpec”: null,
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: null
},
“indexSpec”: {
“bitmap”: {
“type”: “roaring”,
“compressRunOnSerialization”: true
},
“dimensionCompression”: “lz4”,
“metricCompression”: “lz4”,
“longEncoding”: “longs”,
“segmentLoader”: null
},
“indexSpecForIntermediatePersists”: {
“bitmap”: {
“type”: “roaring”,
“compressRunOnSerialization”: true
},
“dimensionCompression”: “lz4”,
“metricCompression”: “lz4”,
“longEncoding”: “longs”,
“segmentLoader”: null
},
“maxPendingPersists”: 0,
“forceGuaranteedRollup”: false,
“reportParseExceptions”: false,
“pushTimeout”: 0,
“segmentWriteOutMediumFactory”: null,
“maxNumConcurrentSubTasks”: 10,
“maxRetry”: 3,
“taskStatusCheckPeriodMs”: 1000,
“chatHandlerTimeout”: “PT10S”,
“chatHandlerNumRetries”: 5,
“maxNumSegmentsToMerge”: 100,
“totalNumMergeTasks”: 10,
“logParseExceptions”: false,
“maxParseExceptions”: 2147483647,
“maxSavedParseExceptions”: 0,
“maxColumnsToMerge”: -1,
“buildV9Directly”: true,
“partitionDimensions”:

When ever we have “maxNumConcurrentSubTasks”: as 10 the job fails on random nodes and the subtask runs on one host and the error messages is seen on different host like below (sub task ran on lxappdrudprds06 and the java excption we are seeing is on lxappdrudprds14). The job runs fine when we have “maxNumConcurrentSubTasks”: as 1

“id”: “single_phase_sub_task_US_SCORECARD_DEV.PRICE_ADJ_DAILY_METRICS_dbghchje_2021-09-23T12:45:14.571Z”,
“groupId”: “index_parallel_US_SCORECARD_DEV.PRICE_ADJ_DAILY_METRICS_ifpmncde_2021-09-23T12:44:05.186Z”,
“type”: “single_phase_sub_task”,
“createdTime”: “2021-09-23T12:45:14.574Z”,
“queueInsertionTime”: “1970-01-01T00:00:00.000Z”,
“statusCode”: “FAILED”,
“status”: “FAILED”,
“runnerStatusCode”: “WAITING”,
“duration”: 44674,
“location”: {
“host”: “lxappdrudprds06”,
“port”: 8101,
“tlsPort”: -1
},
“dataSource”: “US_SCORECARD_DEV.PRICE_ADJ_DAILY_METRICS”,
“errorMsg”: “java.net.UnknownHostException: lxappdrudprds14”
}