Druid Hadoop Batch Ingestion is running for longer consuming lot of resources

hi team,

we are trying to ingest 12Gb of hadoop data into druid, which has 44 dimensions and 50 metrics.
Below is the ingestion script we are using

{
“type”: “index_hadoop”,
“spec”: {
“dataSchema”: {
“metricsSpec”: [

],
“granularitySpec”: {
“queryGranularity”: “none”,
“segmentGranularity”: “month”,
“type”: “uniform”,
“intervals”: [
“2020-08-01T00:00:00.000/2020-09-01T00:00:00.000”
]
},
“parser”: {
“parseSpec”: {
“timestampSpec”: {
“column”: “transaction_created_date”,
“format”: “auto”
},
“dimensionsSpec”: {
“dimensions”: [
]
},
“format”: “timeAndDims”
},
“type”: “parquet”
},
“dataSource”: “test2”
},
“tuningConfig”: {
“rowFlushBoundary”: “200000”,
“forceExtendableShardSpecs”: true,
“useCombiner”: “true”,
“jobProperties”: {
“mapreduce.task.timeout”: 6000000,
“mapreduce.map.memory.mb”: “12288”,
“io.compression.codecs”: “org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec”,
“mapreduce.reduce.memory.mb”: “16384”,
“druid.indexer.runner.javaOpts”: “-server -Xmx16g -Duser.timezone=UTC -Dfile.encoding=UTF-8”,
“mapreduce.job.split.metainfo.maxsize”: “-1”,
“mapreduce.job.queuename”: “risk_dna”,
“mapreduce.reduce.java.opts”: “-server -Xmx16g -Duser.timezone=UTC -Dfile.encoding=UTF-8”,
“mapreduce.job.priority”: “VERY_HIGH”,
“mapreduce.map.java.opts”: “-server -Xmx12g -Duser.timezone=UTC -Dfile.encoding=UTF-8”,
“mapreduce.job.classloader”: “true”
},
“type”: “hadoop”,
“partitionsSpec”: {
“type”: “hashed”,
“targetPartitionSize”: 5000000
}
},
“ioConfig”: {
“inputSpec”: {
“paths”: “hdfs://ba_views.db/lassi_tpvjoin_output/txn_mth_id=2020-08/*.gz.parquet”,
“type”: “static”,
“inputFormat”: “io.druid.data.input.parquet.DruidParquetInputFormat”
},
“type”: “hadoop”,
“appendToExisting” : false
}
}
}

The source data is 12GB in total and partitioned into 100 partitions of each around 120MB in gz parquet format.

this ingestion job is running for around 4 hours.

while the determine_partitions_hashed MR job takes 40 mins
and index-generator takes 3+ hrs.

we don’t see any errors

only this we noticed is higher spills

2020-09-22 06:31:06,368 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2020-09-22 06:31:06,368 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 300757497; bufend = 185419372; bufvoid = 1073741824
2020-09-22 06:31:06,368 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 75189368(300757472); kvend = 73198008(292792032); length = 1991361/67108864
2020-09-22 06:31:06,368 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 186314140 kvi 46578528(186314112)
2020-09-22 06:31:09,539 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 63
2020-09-22 06:31:09,540 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 186314140 kv 46578528(186314112) kvi 46460656(185842624)
2020-09-22 06:32:00,116 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2020-09-22 06:32:00,116 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 186314140; bufend = 70967113; bufvoid = 1073741824
2020-09-22 06:32:00,116 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 46578528(186314112); kvend = 44585300(178341200); length = 1993229/67108864
2020-09-22 06:32:00,116 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 71861881 kvi 17965464(71861856)
2020-09-22 06:32:03,744 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 64

It would be great if anyone can suggest how to improve the runtime and help optimizing the job. And for such data running in queue of about 15TB what is expected runtime.

The hdfs source path size:

10.2 G

Druid deep storage size:

181.6 G

Hi! Is this useful to you? It’s by someone that knows Hadoop way better than me…!

https://imply.io/post/hadoop-indexing-apache-druid-configuration-best-practices