Hadoop Indexer performance tuning

HI There ,

I’m running a backfill with Hadoop indexer for six months worth of data
size to the tune of 30-40 TB . Started with below granularity spec each
job takes around 20-30 minutes , although I have restricted input file
no max to 120-140 for hour still my map side deterministic job task goes
700-800 and reducer 1-2 . I was seeing deterministic job code it seems
due to sparse event time . Although I have time bucket data from upstream
but its not 100 % so assumed grouped may not be an option for me .

granularitySpec": { “type”: “uniform”, “segmentGranularity”: “HOUR”,
“queryGranularity”: “NONE”, “intervals” : [
“2016-06-30T08:00:00.000Z/2016-06-30T08:00:05.000Z” ]
}
},
“ioConfig”: { “type”: “hadoop”, “inputSpec”: { “type”: “granularity”,
“dataGranularity”: “HOUR”, “inputPath”:
“s3n://druid-dev-test/json/data”, “filePattern”: “.*”
}
},
“tuningConfig”: { “type”: “hadoop”, “partitionsSpec”: {
“targetPartitionSize”: 5000000
}
}
}
Now to take make more reducer partition I have changed the
targetPartitionSize like below and now its creating around 160-170
reduce task . However there is not much improvement with overall all job
time

“ioConfig”: { “type”: “hadoop”, “inputSpec”: { “type”: “granularity”,
“dataGranularity”: “HOUR”, “inputPath”: “s3n://druid-dev-test/data”,
“filePattern”: “.*”
}
},
“tuningConfig”: { “type”: “hadoop”, “partitionsSpec”: { “type”:
“hashed”, “targetPartitionSize”: 50000, “maxPartitionSize” : 75000
}
},
“useCombiner”:true,

I am running this now with c3,2xl 20 node Hadoop cluster . I would like
know expert opinion for any improvement here , given this rate its going
to take months for me to backfill all the data .

Thank you .

Also , is there any performance issue if we have many shards in a segment ? so far don’t see much of a difference .