Batch ingestion failed when submit 2 ingestion tasks concurrently

druid cluster has 2 master nodes ,2 query nodes and 3 data nodes

When I submit 2 ingestion tasks concurrently, the ingestion task fails sometime(not always) .

ingestion spec like

{
“type”: “index_parallel”,
“spec”: {
“dataSchema”: {


“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “s3”,
“uris”: null,
“prefixes”: [
“s3://XXX/XXX”
],
“objects”: null,
“properties”: null
},
“inputFormat”: {
“type”: “json”,


“appendToExisting”: true,
“dropExisting”: false
},
“tuningConfig”: {
“type”: “index_parallel”,
“maxRowsPerSegment”: 5000000,
“appendableIndexSpec”: {
“type”: “onheap”
},
“maxRowsInMemory”: 10000,
“maxBytesInMemory”: 0,
“skipBytesInMemoryOverheadCheck”: false,
“maxTotalRows”: null,
“numShards”: null,
“splitHintSpec”: null,
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: null
},
“indexSpec”: {
“bitmap”: {
“type”: “roaring”,
“compressRunOnSerialization”: true
},
“dimensionCompression”: “lz4”,
“metricCompression”: “lz4”,
“longEncoding”: “longs”,
“segmentLoader”: null
},
“indexSpecForIntermediatePersists”: {
“bitmap”: {
“type”: “roaring”,
“compressRunOnSerialization”: true
},
“dimensionCompression”: “lz4”,
“metricCompression”: “lz4”,
“longEncoding”: “longs”,
“segmentLoader”: null
},
“maxPendingPersists”: 0,
“forceGuaranteedRollup”: false,
“reportParseExceptions”: false,
“pushTimeout”: 0,
“segmentWriteOutMediumFactory”: null,
“maxNumConcurrentSubTasks”: 4,
“maxRetry”: 3,
“taskStatusCheckPeriodMs”: 1000,
“chatHandlerTimeout”: “PT10S”,
“chatHandlerNumRetries”: 5,
“maxNumSegmentsToMerge”: 100,
“totalNumMergeTasks”: 10,
“logParseExceptions”: true,
“maxParseExceptions”: 10,
“maxSavedParseExceptions”: 100,
“maxColumnsToMerge”: -1,
“awaitSegmentAvailabilityTimeoutMillis”: 0,
“partitionDimensions”:
}
}
}

Logs ``` 2022-11-04T04:51:41,576 INFO [main] org.eclipse.jetty.server.Server - Started @6895ms 2022-11-04T04:51:41,576 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [ANNOUNCEMENTS] 2022-11-04T04:51:41,576 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Successfully started lifecycle [module] 2022-11-04T04:54:06,742 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://druid-data-prod-us-3-nlb-2fc6a72e9d0c793f.elb.us-east-1.amazonaws.com:8102/druid/worker/v1/chat/index_parallel_master_order_event_ckdmgbak_2022-11-04T04%3A23%3A26.784Z/report]; will try again in [PT2S] (body/exception: [Connection timed out (Connection timed out)]) 2022-11-04T04:56:19,862 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://druid-data-prod-us-3-nlb-2fc6a72e9d0c793f.elb.us-east-1.amazonaws.com:8102/druid/worker/v1/chat/index_parallel_master_order_event_ckdmgbak_2022-11-04T04%3A23%3A26.784Z/report]; will try again in [PT4S] (body/exception: [Connection timed out (Connection timed out)]) 2022-11-04T04:58:35,030 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://druid-data-prod-us-3-nlb-2fc6a72e9d0c793f.elb.us-east-1.amazonaws.com:8102/druid/worker/v1/chat/index_parallel_master_order_event_ckdmgbak_2022-11-04T04%3A23%3A26.784Z/report]; will try again in [PT8S] (body/exception: [Connection timed out (Connection timed out)]) 2022-11-04T05:00:54,294 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://druid-data-prod-us-3-nlb-2fc6a72e9d0c793f.elb.us-east-1.amazonaws.com:8102/druid/worker/v1/chat/index_parallel_master_order_event_ckdmgbak_2022-11-04T04%3A23%3A26.784Z/report]; will try again in [PT10S] (body/exception: [Connection timed out (Connection timed out)]) 2022-11-04T05:03:13,558 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.IndexTaskClient - Bad response HTTP [no response] from [http://druid-data-prod-us-3-nlb-2fc6a72e9d0c793f.elb.us-east-1.amazonaws.com:8102/druid/worker/v1/chat/index_parallel_master_order_event_ckdmgbak_2022-11-04T04%3A23%3A26.784Z/report]; will try again in [PT10S] (body/exception: [Connection timed out (Connection timed out)]) 2022-11-04T05:05:32,823 WARN [task-runner-0-priority-0] org.apache.druid.indexing.common.IndexTaskClient - Retries exhausted for [http://druid-data-prod-us-3-nlb-2fc6a72e9d0c793f.elb.us-east-1.amazonaws.com:8102/druid/worker/v1/chat/index_parallel_master_order_event_ckdmgbak_2022-11-04T04%3A23%3A26.784Z/report], last exception: java.net.ConnectException: Connection timed out (Connection timed out) at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_275] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_275] at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_275] at java.net.Socket.connect(Socket.java:556) ~[?:1.8.0_275] at java.net.Socket.(Socket.java:452) ~[?:1.8.0_275] at java.net.Socket.(Socket.java:229) ~[?:1.8.0_275] at org.apache.druid.indexing.common.IndexTaskClient.checkConnection(IndexTaskClient.java:209) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.IndexTaskClient.submitRequest(IndexTaskClient.java:348) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.IndexTaskClient.submitSmileRequest(IndexTaskClient.java:258) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTaskClient.report(ParallelIndexSupervisorTaskClient.java:120) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialDimensionDistributionTask.sendReport(PartialDimensionDistributionTask.java:301) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialDimensionDistributionTask.runTask(PartialDimensionDistributionTask.java:239) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:159) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:471) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:443) ~[druid-indexing-service-0.22.1.jar:0.22.1] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_275] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_275] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_275] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_275] 2022-11-04T05:05:32,836 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Exception while running task[AbstractTask{id='partial_dimension_distribution_master_order_event_bgldfega_2022-11-04T04:51:34.628Z', groupId='index_parallel_master_order_event_ckdmgbak_2022-11-04T04:23:26.784Z', taskResource=TaskResource{availabilityGroup='partial_dimension_distribution_master_order_event_bgldfega_2022-11-04T04:51:34.628Z', requiredCapacity=1}, dataSource='master_order_event', context={forceTimeChunkLock=true, useLineageBasedSegmentAllocation=true}}] java.lang.RuntimeException: java.net.ConnectException: Connection timed out (Connection timed out) at org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTaskClient.report(ParallelIndexSupervisorTaskClient.java:137) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialDimensionDistributionTask.sendReport(PartialDimensionDistributionTask.java:301) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.task.batch.parallel.PartialDimensionDistributionTask.runTask(PartialDimensionDistributionTask.java:239) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:159) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:471) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:443) ~[druid-indexing-service-0.22.1.jar:0.22.1] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_275] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_275] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_275] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_275] Caused by: java.net.ConnectException: Connection timed out (Connection timed out) at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_275] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_275] at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_275] at java.net.Socket.connect(Socket.java:556) ~[?:1.8.0_275] at java.net.Socket.(Socket.java:452) ~[?:1.8.0_275] at java.net.Socket.(Socket.java:229) ~[?:1.8.0_275] at org.apache.druid.indexing.common.IndexTaskClient.checkConnection(IndexTaskClient.java:209) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.IndexTaskClient.submitRequest(IndexTaskClient.java:348) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.IndexTaskClient.submitSmileRequest(IndexTaskClient.java:258) ~[druid-indexing-service-0.22.1.jar:0.22.1] at org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTaskClient.report(ParallelIndexSupervisorTaskClient.java:120) ~[druid-indexing-service-0.22.1.jar:0.22.1] ... 9 more 2022-11-04T05:05:32,840 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: { "id" : "partial_dimension_distribution_master_order_event_bgldfega_2022-11-04T04:51:34.628Z", "status" : "FAILED", "duration" : 832132, "errorMsg" : "java.lang.RuntimeException: java.net.ConnectException: Connection timed out (Connection timed out)", "location" : { "host" : null, "port" : -1, "tlsPort" : -1 } } ```

“chatHandlerTimeout”: “PT10S” is 10 seconds , but the actual retry takes about 2 minutes. so I do not think increase “chatHandlerTimeout” can help .
can someone help me to figure out how to solve this failure when run concurrent ingestion task?

Relates to Apache Druid druid:0.22.1

Hi @yijun. Are these 2 ingestion tasks writing to different datasources?

Hi @Mark_Herrera ,no , writing to same datasource .
This happens on prod environment . on dev environment , no such issue .
2 difference between prod environment and dev environment are:
data node number: prod has 3 data nodes and dev has 2 data nodes

druid.host value: prod use dns name : something like: druid-data-prod-us-3-nlb-2fc6a72e9d0c793g.elb.us-east-1.amazonaws.com
dev use ip directly: 10.1.31.98

another question : how to run ingestion tasks one by one instead of running them concurrently( it is temporary solution for me now to queue the ingestion submits).

I think that the general rule of thumb is to avoid this, since you’re likely to get errors due to locks. However:

One thing that comes to mind with your production environment is possibly CPU and memory contention? I’m not sure that the druid.host property would matter:

The host for the current process. This is used to advertise the current processes location as reachable from another process and should generally be specified such that http://${druid.host}/ could actually talk to this process

I think this is worth a look:

@yijun
What error are they reporting when they fail? Can you share the stack trace?

The problem solved . it is due to our devops only opened 8100 port on middlemanager . we changed to open 8100-8109, it works now. Thanks every one.

1 Like