Index task taking more time to execute 1.3 mb file with 11211 records.It took 60 minutes(One hour )

Hi All,

Iam executing index task with below details

no of rows in file:11211

file size:1.3 mb

interval given to index task: 7 months

Time taken to execute index task:60 minutes (One hour)

Executing index task with appendToExisting=false.

Please tell me how to improve the performance of Index Task.

Thanks in Advance.

Hey Banesh,

My guess is that it is generating too many segments. With 11211 records spread over 7 months you should consider setting segmentGranularity to MONTH or even YEAR.

For me index task segment granularity is Day.

These are my middle manager settings

druid.indexer.fork.property.druid.processing.numMergeBuffers=2
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=300000000
druid.indexer.fork.property.druid.processing.numThreads=7
druid.worker.capacity=10

I have one more doubt here.

what is the difference between “druid.indexer.fork.property.druid.processing.numThreads” and “druid.worker.capacity”.

I can see Nof workers starting with jps command.

How nof of threads will maintained by Druid.

Hi Banesh,

processing.numMergeBuffers, processing.buffer.sizeBytes, and processing.numThreads are related to query resources and are only used when your indexing task also responds to queries which applies to realtime-type tasks like Kafka indexing. When ingesting from a file (i.e. batch ingestion), these configurations are not used since with batch ingestion, the data isn’t available for querying until after the job completes.

druid.worker.capacity controls how many indexing tasks can be spawned by the middle manager at a time. If it’s too low, your hardware will be underutilized, and if it’s too high, you’ll run out of resources on the machine.

I don’t think that either of these settings is the cause for your slow ingestion. Did you try changing segmentGranularity to YEAR? If making that change hasn’t improved things, post your indexing spec here and we can check if there are other settings that need to be tweaked.

Hi DavidLim ,

Please find below my index task specifications.

{

“type”: “index”,
“spec”: {
“dataSchema”: {
“metricsSpec”: [
{
“name”: “edrcount”,
“type”: “count”
}
],
“parser”: {
“parseSpec”: {
“dimensionsSpec”: {
“dimensions”: [
“intcolumn1”,
“intcolumn2”,
“intcolumn3”
]
},
“columns”: [
“stringcolumn1”,
“stringcolumn2”,
“datecolumn1”,
“intcolumn1”,
“intcolumn2”,
“intcolumn3”,
“intcolumn4”
],
“format”: “csv”,
“timestampSpec”: {
“format”: “yyyy-MM-dd HH:mm:ss”,
“column”: “datecolumn1”
}
},
“type”: “string”
},
“granularitySpec”: {
“intervals”: [
“2018-01-01/2018-08-31”
],
“segmentGranularity”: “day”,
“queryGranularity”: “day”,
“type”: “uniform”
},
“dataSource”: “test_summary_12”
},
“ioConfig”: {
“firehose”: {
“baseDir”: “/apps/datafiles_1/testfiles/”,
“filter”: “testfile.txt”,
“type”: “local”
},
“appendToExisting”: “true”,
“type”: “index”
},
“tuningConfig”: {
“targetPartitionSize”: 5000000,
“type”: “index”,
“maxRowsInMemory”: 39999
}
}
}