Index_parallel only create one subtask

I’m trying to optimize druid ingest performance by using subtasks.
Datasource is from S3, there are 17 .gz files in one bucket, each size 34.4MB
As this is just for testing, I copied 16 times from one .gz file just to make ingest file larger,
So 17 files are identical except for file name.

Druid components are running on docker:

My question is why there’s only 1 subtask?
single_phase_sub_task_stresstest_bjenbgek_2022-08-11T08:04:13.017Z

TaskId: single_phase_sub_task_stresstest_bjenbgek_2022-08-11T08:04:13.017Z GroupId:555 Type:single_phase_sub_task

TaskId: 555 GroupId:555 Type:index_parallel

My understanding is druid will spawn as many subtasks as maxNumConcurrentSubTasks, I have set this number to 16 (4 mm * 4 slots each), but still only 1 subtask is running.

paste my task payload here:

{
“type”: “index_parallel”,
“id”: “555”,
“groupId”: “555”,
“resource”: {
“availabilityGroup”: “555”,
“requiredCapacity”: 1
},
“spec”: {
“dataSchema”: {
“dataSource”: “stresstest”,
“timestampSpec”: {
“column”: “MISSING_COLUMN”,
“format”: “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’”,
“missingValue”: “1970-01-01T00:00:00.888Z”
},
“dimensionsSpec”: {
“dimensions”: [
{
“type”: “string”,
“name”: “content_identifier”,
“multiValueHandling”: “SORTED_ARRAY”,
“createBitmapIndex”: true
},
{
“type”: “string”,
“name”: “information_package_iri”,
“multiValueHandling”: “SORTED_ARRAY”,
“createBitmapIndex”: true
},
{
“type”: “string”,
“name”: “iri”,
“multiValueHandling”: “SORTED_ARRAY”,
“createBitmapIndex”: true
},
{
“type”: “long”,
“name”: “x1_value_int”,
“multiValueHandling”: “SORTED_ARRAY”,
“createBitmapIndex”: false
},
{
“type”: “long”,
“name”: “y1_value_int”,
“multiValueHandling”: “SORTED_ARRAY”,
“createBitmapIndex”: false
}
],
“dimensionExclusions”: [
“__time”,
MISSING_COLUMN
]
},
“metricsSpec”: ,
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: {
“type”: “none”
},
“queryGranularity”: {
“type”: “none”
},
“rollup”: false,
“intervals”: [
“1970-01-01T00:00:00.888Z/1970-01-01T00:00:00.889Z”
]
},
“transformSpec”: {
“filter”: null,
“transforms”:
}
},
“ioConfig”: {
“type”: “index_parallel”,
“inputSource”: {
“type”: “s3”,
“uris”: null,
“prefixes”: [
“s3://bucketname/foldername/”
],
“objects”: null,
“properties”: null
},
“inputFormat”: {
“type”: “json”,
“flattenSpec”: {
“useFieldDiscovery”: true,
“fields”:
},
“featureSpec”: {}
},
“appendToExisting”: false,
“dropExisting”: false
},
“tuningConfig”: {
“type”: “index_parallel”,
“maxRowsPerSegment”: 5000000,
“appendableIndexSpec”: {
“type”: “onheap”
},
“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0,
“skipBytesInMemoryOverheadCheck”: false,
“maxTotalRows”: null,
“numShards”: null,
“splitHintSpec”: {
“type”: “maxSize”,
“maxSplitSize”: 1073741824,
“maxNumFiles”: 1000
},
“partitionsSpec”: {
“type”: “dynamic”,
“maxRowsPerSegment”: 5000000,
“maxTotalRows”: null
},
“indexSpec”: {
“bitmap”: {
“type”: “roaring”,
“compressRunOnSerialization”: true
},
“dimensionCompression”: “lz4”,
“metricCompression”: “lz4”,
“longEncoding”: “longs”,
“segmentLoader”: null
},
“indexSpecForIntermediatePersists”: {
“bitmap”: {
“type”: “roaring”,
“compressRunOnSerialization”: true
},
“dimensionCompression”: “lz4”,
“metricCompression”: “lz4”,
“longEncoding”: “longs”,
“segmentLoader”: null
},
“maxPendingPersists”: 0,
“forceGuaranteedRollup”: false,
“reportParseExceptions”: false,
“pushTimeout”: 0,
“segmentWriteOutMediumFactory”: null,
“maxNumConcurrentSubTasks”: 16,
“maxRetry”: 3,
“taskStatusCheckPeriodMs”: 1000,
“chatHandlerTimeout”: “PT10S”,
“chatHandlerNumRetries”: 5,
“maxNumSegmentsToMerge”: 100,
“totalNumMergeTasks”: 10,
“logParseExceptions”: false,
“maxParseExceptions”: 2147483647,
“maxSavedParseExceptions”: 0,
“maxColumnsToMerge”: -1,
“awaitSegmentAvailabilityTimeoutMillis”: 0,
“partitionDimensions”:
}
},
“context”: {
“forceTimeChunkLock”: true,
“useLineageBasedSegmentAllocation”: true
},
“dataSource”: “stresstest”
}

Hi Shi-chen and welcome to the druid forum.

This is happening because your files are small and a single task can read all of them. This is good because this will only write one segment instead of one each for each file.

If you really want to have multiple tasks, you can change this value from the default of 1GB in the splitHintSpec to a smaller number

“maxSplitSize”: 1073741824

This tells druid the max amount of data (from files(s)) a single task can ingest.

Appreciate your answer, now I can run multiple subtasks.
But what do you mean by “This is good because this will only write one segment instead of one each for each file”? Shouldn’t multiple subtasks have better performance?

Multiple sub-tasks will absolutely speed up ingestion. But they have the tendency to create too many segments when using dynamic partition and these segments could end up being too small . We want the segments to be about 300-700MB and have ~5M rows. Too small segments lead to inefficiencies during query time.

ex. 1 segment of the right size will be more efficient than 17 smaller sized segments

Our recommendation is to use ‘range’ partition if you are on the later versions of druid or ‘hashed’ if using an older version of druid that does not support ‘range’.

Hey @shi-zhen :). I found this doc quite useful when understanding segment sizes…