Baseline processing time for Hadoop Indexer

Hello, we are currently reindexing a day’s worth of events into Druid using the Hadoop indexer. I was wondering if there was any sort of baseline amount of time that you had encountered when using this utility? Obviously mileage varies a lot depending on setup, so mine is as follows:

We’re reprocessing about 230GB of data using one indexing task

Cluster: EMR running Hadoop 2.6.0, Druid 0.9.0

Master: r3.4xlarge

Core: 40 r3.2xlarges

I also threw in some additional configurations, some of which may be hindering our performance:

‘yarn.nodemanager.vmem-check-enabled’:‘false’,

‘yarn.nodemanager.vmem-pmem-ratio’:‘4’,

‘mapreduce.task.timeout’:‘1800000’,

‘mapreduce.reduce.memory.mb’:‘16348’,

‘mapreduce.map.memory.mb’:‘7296’

‘mapreduce.map.java.opts’: ‘-server -Xms2918m -Duser.timezone=UTC -Dfile.encoding=UTD-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps’

‘mapreduce.reduce.java.opts’:’-server -Xms8g -Xmx8g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps’

This takes between 3 and 4 hours right now. Additionally, we had about 75 additional GB of data land one day, and that either caused the job to double in processing time, or caused the indexing task to fail on the final reduce step due to memory issues. What I’d like to know is whether this setup is hindering our performance some way, or if that’s not an easy question to answer, perhaps your setup and findings when running a hadoop indexing task. Appreciate any help, I’m stumbling around in the dark right now.

Hi John,
Could you also share your spec file and optionally logs for batch ingestion for some more details ?