Low number of mappers in EMR with HadoopDruidIndexer

I’m using Druid 0.7.0 and I’m trying to use the Hadoop batch ingestion, since the dataset is quite large (around 5.5B rows). I launched an EMR cluster (Hadoop 2.4.0 + AMI 3.5.0) and was able to get the indexer job running. The cluster has 20 nodes of c3.8xlarge instances. Number of files it’s reading from S3 is 6816, which corresponds to the total number of mappers. However, I’m noticing only around ~30 mappers running concurrently at one time, so it’s not really taking advantage of the entire cluster’s capacity. What’s interesting is launching a 5 node cluster yields the same number of concurrent mappers.

Not sure if there is something specific to the EMR configuration or the HadoopDruidIndexer job. But based on the yarn-site.xml and mapred-site.xml config files, and based on the memory and vcores settings, I expect a maximum of 32 mappers running per node. Am I missing something?

I don’t think there’s anything specific in Druid’s hadoop stuff that should cause that. Out of curiosity, on your hadoop cluster’s status page, what shows up under “Containers Running”, “Memory Used”, and “Memory Total”? Are you running any jobs other than the Druid indexer?

While the mappers are running, the number of containers running hovers around 45-55. Memory Used is around 350GB and Memory Total is around 1.2 TB. The status page indicates the cluster is not fully utilized (used capacity percentage). I’m not running anything else besides the Druid indexer. Once the mappers finish, I notice the reducers run more concurrently and end up using most of the cluster’s capacity. So, it’s only the mappers that aren’t utilizing the cluster. The EMR is configured as follows:

Master node: m3.2xlarge

Core nodes: 20 nodes c3.8xlarge

Task nodes: none

Hadoop: 2.4.0 (AMI 3.5.0)

Interestingly, when I launched 20 task nodes instead of core nodes (just launched 5 core nodes), the cluster seems to be performing a bit better. I’m now getting around ~140 mappers running concurrently (compared to ~30 before). I thought core and task nodes both run tasks as needed. Anyone with experience using EMR have an explanation for this?

I still feel the cluster can do better than 140 concurrent mappers though.

I’m not really sure why that’s happening. We don’t run EMR, but we do run on AWS (also with c3.8xlarges and a core/task split, but using the stock Apache Hadoop rather than EMR). Fwiw, we usually run our jobs with these options:

mapreduce.map.memory.mb

2048

mapreduce.map.java.opts

-server -Xmx1536m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

mapreduce.reduce.memory.mb

6144

mapreduce.reduce.java.opts

-server -Xmx2560m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

And so with that, a c3.8xlarge with memory capacity set to 60GB should be able to run 30 mappers, or with memory capacity set to 50GB should be able to run 25 mappers. It shouldn’t matter if it’s a core or a task node. We haven’t seen issues with low utilization like you’re describing. Sorry I can’t be more helpful…

Possibly one thing to look at is whether all nodes are equally lowly utilized, or whether some nodes are packed full and some nodes are empty. The latter might indicate some configuration problems with the cluster.