Native index_parallel Ingestion failing at scale with Caused by: Failed to connect to host

I am exploring new Druid-0.21.1 version with 100MiddleManager and 2 master node( coordinator & overlord on same machine ).
I am running multiple single dimension native index_parallel ingestion from s3 to druid which contains around 500GB of data shared across 14 datasources
there are couple ingestions successfully completed but but some of them are failing with below reasons

INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "partial_index_generic_merge_scheduling_metrics_blue_gpboefma_2021-10-26T08:47:25.419Z",
  "status" : "FAILED",
  "duration" : 47291,
  "errorMsg" : "java.nio.channels.ClosedByInterruptException",
  "location" : {
    "host" : null,
    "port" : -1,
    "tlsPort" : -1
Caused by: Failed to connect to host[https://ip-***:8291]
	at$2.operationComplete( ~[druid-core-0.21.1.jar:0.21.1]
	at ~[netty-3.10.6.Final.jar:?]
	at ~[netty-3.10.6.Final.jar:?]
	at ~[druid-core-0.21.1.jar:0.21.1]
	at ~[druid-core-0.21.1.jar:0.21.1]
	at$ImmediateCreationResourceHolder.get( ~[druid-core-0.21.1.jar:0.21.1]
	at ~[druid-core-0.21.1.jar:0.21.1]
	at ~[druid-core-0.21.1.jar:0.21.1]
	at ~[druid-core-0.21.1.jar:0.21.1]
	at ~[druid-core-0.21.1.jar:0.21.1]
	at org.apache.druid.indexing.common.task.batch.parallel.HttpShuffleClient.lambda$fetchSegmentFile$0( ~[druid-indexing-service-0.21.1.jar:0.21.1]

any help is appreciated.


Hm I wonder if your merge step is failing because of memory or something like that… are there any other hints in the ingestion task logs themselves?

Also, have you tried with many fewer MiddleManager processes (say, 10) and many more cores per node to increase druid.worker.capacity? Configuration reference · Apache Druid