[druid-user] very big segment no partitioning

Hi,
I am using native batch import with day granularity with hashed partitioning.

I get only one big segment of 50000000 lines while I set targetRowsPerSegment to 5000000. Am I missing something ?

this is what my task configuration looks like:

{
“type” : “index_parallel”,
“spec” : {
“ioConfig”: {
“type” : “index_parallel”,
“inputSource”: {
“type”: “local”,
“filter”:“feeddruidzPXVpI_*”,
“baseDir”: “/home/tmp/”
},
“inputFormat”: {
“type”: “csv”,
“listDelimiter”:";",
“columns” : [
“eventDate”, …
]
},
“appendToExisting”:false
},

“dataSchema” : {
“dataSource” : “DATASOURCE1”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “day”,
“queryGranularity” : “day”,
“intervals” : [“2021-01-07T00:00:00Z/2021-01-08T00:00:00Z”]
},
“dimensionsSpec” : {
“dimensions” : […]
},
“timestampSpec” : {
“format” : “posix”,
“column” : “eventDate”
},
“metricsSpec” : [
{
“name” : “count”,
“type” : “count”
}, …
]
},
“tuningConfig” : {
“type” : “index_parallel”,
“forceGuaranteedRollup”:true,
“maxNumConcurrentSubTasks”: 2,
“partitionsSpec” : {
“type” : “hashed”,
“targetRowsPerSegment” : 5000000
}
}
}
}

Segmentation is controlled by your granularitySpec and is always done based on time. Partitioning is secondary. Right now you have it set to 1 segment per day and only a single day defined. If all your data falls into that interval, it will all go into that segment.

thank you Rachel. Sorry, I suppose I used wrong terminology.
I think I meant why do I have only one big partition of 50M rows when targetRowsPerSegment is 5M ?
According to documentation, targetRowsPerSegment is “A target row count for each partition”.

Before, I was using hadoop indexing and I had multiple partitions per day. for example on january 5:

2021-01-05T00:00:00.000Z_2021-01-06T00:00:00.000Z/2021-01-06T05:40:44.393Z/0/index.zip
2021-01-05T00:00:00.000Z_2021-01-06T00:00:00.000Z/2021-01-06T05:40:44.393Z/1/index.zip
2021-01-05T00:00:00.000Z_2021-01-06T00:00:00.000Z/2021-01-06T05:40:44.393Z/2/index.zip
2021-01-05T00:00:00.000Z_2021-01-06T00:00:00.000Z/2021-01-06T05:40:44.393Z/3/index.zip
2021-01-05T00:00:00.000Z_2021-01-06T00:00:00.000Z/2021-01-06T05:40:44.393Z/4/index.zip
2021-01-05T00:00:00.000Z_2021-01-06T00:00:00.000Z/2021-01-06T05:40:44.393Z/5/index.zip
2021-01-05T00:00:00.000Z_2021-01-06T00:00:00.000Z/2021-01-06T05:40:44.393Z/6/index.zip
2021-01-05T00:00:00.000Z_2021-01-06T00:00:00.000Z/2021-01-06T05:40:44.393Z/7/index.zip
2021-01-05T00:00:00.000Z_2021-01-06T00:00:00.000Z/2021-01-06T05:40:44.393Z/8/index.zip
2021-01-05T00:00:00.000Z_2021-01-06T00:00:00.000Z/2021-01-06T05:40:44.393Z/9/index.zip

Hi,
We are facing the same issue while moving from hadoop ingestion to native ingestion (index_parallel).
Everything is working well except partitionsSpec configuration. We are using druid 0.20.0
Previously in the hadoop partitionsSpecs we were using targetPartitionSize but it was deprecated for targetRowsPerSegment.
In native ingestion there’s only targetRowsPerSegment

So we update our ingestion specs but when we look to the ingestion in the UI the parameter is translated to maxRowsPerSegment and now our data is not partitioned anymore, only 1 big segment.

here the config for partitions that we post to druid
“partitionsSpec”: {
“type”: “hashed”,
“targetRowsPerSegment”: 7500000,
“partitionDimensions”: [
“columnA”
]
},

here what’s translated in the UI
“tuningConfig”: {
“type”: “index_parallel”,
“maxRowsPerSegment”: 7500000,
“maxRowsInMemory”: 1000000,
“maxBytesInMemory”: 0,
“maxTotalRows”: null,
“numShards”: null,
“splitHintSpec”: null,
“partitionsSpec”: {
“type”: “hashed”,
“numShards”: null,
“partitionDimensions”: [
“columnA”
],
“partitionFunction”: “murmur3_32_abs”,
“maxRowsPerSegment”: 7500000
},

We have a second ingestion that use numShards and not targetRowsPerSegment and it works.
is that a bug ? how can we have partition size working on index_parallel ?