Hadoop Indexer MapReduce Efficiency

I’ve been using the hadoop indexer for Druid in a lambda architecture to replay an entire day’s worth of events into our data source at the end of the day. This has been working pretty well for me, but I have noticed that Hadoop seems to move resources away from map tasks and toward reduce tasks in the last MR step in the indexing job, thus slowing the completion. For context, there are 40 machines with 60GB of memory each in this hadoop cluster. Here is the step I’m referring to:

2016-07-14T13:41:07,944 INFO [main] org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 11273
2016-07-14T13:41:08,648 INFO [main] org.apache.hadoop.mapreduce.JobSubmitter - number of splits:11273
2016-07-14T13:41:08,691 INFO [main] org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1468492040861_0002
2016-07-14T13:41:08,692 INFO [main] org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2016-07-14T13:41:08,873 INFO [main] org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1468492040861_0002
2016-07-14T13:41:08,876 INFO [main] org.apache.hadoop.mapreduce.Job - The url to track the job: http://ip-0.0.0.8.ec2.internal:20888/proxy/application_1468492040861_0002/
2016-07-14T13:41:08,876 INFO [main] io.druid.indexer.IndexGeneratorJob - Job sparrow-log-hour-index-generator-Optional.of([2016-07-12T00:00:00.000Z/2016-07-14T00:00:00.000Z]) submitted, status available at http://ip-0.0.0.8.ec2.internal:20888/proxy/application_1468492040861_0002/
2016-07-14T13:41:08,876 INFO [main] org.apache.hadoop.mapreduce.Job - Running job: job_1468492040861_0002

As you can see, there are quite a few map tasks to run. These actually run quite quickly - they take about 30 seconds each. The reduce jobs, on the other hand, seem to get to about 10% complete and then stop, no matter when they were submitted, until all of the maps are done. A few screenshots:

[<img src="https://lh3.googleusercontent.com/-n-e5VKikt2M/V4eowTdnhlI/AAAAAAAAATE/_t7cUCX-gKgsLlPQgoh58Yfls4J_U_pJwCLcB/s1600/Screen%2BShot%2B2016-07-14%2Bat%2B10.51.22%2BAM.png" border="0">](https://lh3.googleusercontent.com/-n-e5VKikt2M/V4eowTdnhlI/AAAAAAAAATE/_t7cUCX-gKgsLlPQgoh58Yfls4J_U_pJwCLcB/s1600/Screen%2BShot%2B2016-07-14%2Bat%2B10.51.22%2BAM.png)


When the job originally started, it was running about 150 map tasks and about 10 reduce tasks. Now, that’s flipped on its head and the job is crawling along. Here’s what the reduce progress looks like:

[<img src="https://lh3.googleusercontent.com/-kJc1mnHx4IA/V4eopST0eFI/AAAAAAAAATA/A8eWY_QhTBgG7V2nYiO-TQqhxPqenWhagCLcB/s1600/Screen%2BShot%2B2016-07-14%2Bat%2B10.57.08%2BAM.png" border="0" style="">](https://lh3.googleusercontent.com/-kJc1mnHx4IA/V4eopST0eFI/AAAAAAAAATA/A8eWY_QhTBgG7V2nYiO-TQqhxPqenWhagCLcB/s1600/Screen%2BShot%2B2016-07-14%2Bat%2B10.57.08%2BAM.png)

All 3 pages of the reduce tasks look like this, old or new.

So ultimately, it seems to me that the reduce tasks need all of the maps to finish before they can get past initialization. Is there some kind of config to tune this? Or is this crucial to the job and should I just let it be? Appreciate the help on this.

Hey John,

This is just generally a Hadoop thing. You can delay the reducer start (leaving more resources for mappers) by setting mapreduce.job.reduce.slowstart.completedmaps to something like 0.9 or 1.0.

Hey Gian, thanks for the help on this. That setting has helped, but I separately I noticed that the number of maps is stuck between 15 and 20, despite having 40 boxes running this task. Is that because there is some amount of startup cost to a Mapper Task, and they haven’t shown up in the tracker yet? Or is there a way to increase the number of maps? I tried updating mapreduce.tasktracker.map.tasks.maximum but the internet stresses that is just a guideline for Hadoop and won’t necessarily increase the number of concurrent maps. I don’t think I can change the size of the files I am ingesting - is there another way to force more parallel maps? The maps for the hadoop index task are pretty low-CPU and memory, so I’d really like to throw as much horsepower as I can at it. Thanks again for your help on this!

Hey John,

Do you mean the total number of mappers, or the number of running mappers? Are any reducers running? If you set that slowstart setting to 1.0, there should be no reducers running until all the mappers finish, so you should be able to run as many as you have capacity for (in terms of vcores and memory). Perhaps check that your vcore and memory limits in YARN are set appropriately.

Hey Gian,

The number of running mappers seems capped at about 20. Interestingly, I set slowstart to 0.9, and once the reduce step started the number of mappers increased to 150 running, which I didn’t expect to happen. I’ll read up on how to properly set vcore and memory-mb in yarn - I was letting AWS supply the defaults on that, but perhaps there is some tuning to be done. Once again, I appreciate the help you have given so far.

Hey John,

Okay, I think I got myself confused too! I think mappers need to stay up until all the reducers get their data, so setting slowstart to 1.0 is probably not a great idea.

If you’re using EMR, then it should already have some reasonable defaults for your machines’ vcore and memory limits.

If your mappers are exiting way too quickly, it’s possible the files you’re reading are too small to be efficient and you actually want fewer mappers. Something else you could try is using Druid’s “combineText” setting, which switches to a combining InputFormat and makes Hadoop read multiple files into the same mapper.

I tried this out, but it actually combined everything into a single mapper, so I don’t think it’s what I want. I’m going to see if I can make larger files (the ones we are reading are <64mb each, which is half of the default block size for Hadoop2) and see if that makes more efficient maps. I’ll let you know what ends up working best for me