I’m using Druid 0.7.0 and I’m trying to use the Hadoop batch ingestion, since the dataset is quite large (around 5.5B rows). I launched an EMR cluster (Hadoop 2.4.0 + AMI 3.5.0) and was able to get the indexer job running. The cluster has 20 nodes of c3.8xlarge instances. Number of files it’s reading from S3 is 6816, which corresponds to the total number of mappers. However, I’m noticing only around ~30 mappers running concurrently at one time, so it’s not really taking advantage of the entire cluster’s capacity. What’s interesting is launching a 5 node cluster yields the same number of concurrent mappers.
Not sure if there is something specific to the EMR configuration or the HadoopDruidIndexer job. But based on the yarn-site.xml and mapred-site.xml config files, and based on the memory and vcores settings, I expect a maximum of 32 mappers running per node. Am I missing something?