EMR Data Ingestion - cluster not utilized

Hey,

I’m trying to ingest tons of files (about 22K files) from S3 into druid using an EMR mapreduce cluster.

The problem is that the cluster is not utilising all its resources. I have a cluster with ~500 vCores and only ~10 are used and it never get any closer to the ~500 - see attached image.

When I tried running with a smaller cluster of ~80 vCPUs and ingest less files (~400) files I saw close to 60 vCPUs used. There’s definitely something wrong with my configuration.

Does anyone have an idea how can this be fixed?

This is the task i’m running:

{

“type” : “index_hadoop”,

“spec” : {

"ioConfig" : {

  "type" : "hadoop",

  "inputSpec" : {

    "type" : "granularity",

    "dataGranularity" : "day",

    "inputPath" : "s3://xxx/",

    "filePattern" : ".*\\.gz",

    "pathFormat" : "'dt'=yyyyMMdd/"

  }

},

"dataSchema" : {

  "dataSource" : "traceback",

  "granularitySpec" : {

    "type" : "uniform",

    "segmentGranularity" : "hour",

    "queryGranularity" : "none",

    "intervals" : ["2017-12-20/2017-12-21"]

  },

  "parser" : {

    "type" : "hadoopyString",

    "parseSpec" : {

      "format" : "json",

      "dimensionsSpec" : {

        "dimensions" : [

          "uid",

          "ccr",

          "intg",

          "adType",

          "platform"

        ]

      },

      "timestampSpec" : {

        "format" : "auto",

        "column" : "timest"

      }

    }

  },

  "metricsSpec" : [

    {

      "name" : "count",

      "type" : "count"

    }

  ]

},

"tuningConfig" : {

  "type" : "hadoop",

  "partitionsSpec" : {

    "type" : "hashed",

    "targetPartitionSize" : 5000000

  },

  "jobProperties" : {

“fs.s3.awsAccessKeyId” : “xxx”,

“fs.s3.awsSecretAccessKey” : “xxx”,

“fs.s3.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,

“fs.s3n.awsAccessKeyId” : “xxx”,

“fs.s3n.awsSecretAccessKey” : “xxx”,

“fs.s3n.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,

“io.compression.codecs” : “org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec”,

    "mapreduce.job.classloader": "true",

    "mapreduce.job.classloader.system.classes": "-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.,org.apache.http.,org.jets3t."

    }

}

}

}

I’m not that familiar with using EMR or Hadoop in general, but maybe you can try tweaking the “Resource Calculator”, as mentioned here: https://stackoverflow.com/questions/29964792/apache-hadoop-yarn-underutilization-of-cores

Thanks Jonathan!

I thought druid is supposed to tweak hadoop for me ... I didn't know I need to do any special configurations as well.

The problem is that both CPU and Memory are underutilized so I'm not sure this solution will help.

Thanks!

I’m having the same issue for the EMR cluster being underutilised. The last time I tried 6 months ago, my cluster could reach ~100% utilization rate. Not sure what has changed now though.

So you’re saying it’s something with the newer versions of druid ?

It could be that, or AWS EMR has added some restricting default dimensions we may not know of?

Sorry I’m late to the party. We’re experiencing low cluster utilization as well. We’re running Druid 0.13.0 with Hadoop 2.8.5 and the memory and VCores are less than 25% of their potential.

Did you guys figure out what’s causing it? I tried anything that comes to mind but not results.

Thanks