Druid hadoop EMR indexing is extremly slow

Hi Druid community,
I tried indexing from S3 with the indexing service it turned out to be slow and as recommended in https://groups.google.com/forum/#!topic/druid-user/lxQlvzTbrgk by Fangjin, I switched from local indexing service to an EMR hadoop cluster.
It took > 5 minutes for map/reduce to finish ~100M worth of data indexing, which seems insane to me. Below is my setup EMR log is also attached, I’m pretty sure this is not the expected performance, can you help to check if anything obvious I missed?
It would be appreciated if an performance estimate can be provided (for this much data, this amount of time is expected to index the data).
Thanks in advance
Shuai
Hardware:

  • 8 r3.4xlarge nodes
  • EMR hadoop config is default config
    Data:
    7 TSV files (~180 columns) where each file is ~15M .gz (~100,000 rows per file) in one batch. > 100MB per file after uncompress
    Hadoop spec:
    {
    “spec” : {
    “dataSchema” : {
    “dataSource” : “stats_60_v1”,
    “parser” : {
    “type” : “string”,
    “parseSpec” : {
    “format” : “tsv”,
    “timestampSpec” : {
    “column” : “v_timestamp_end_utc”,
    “format” : “yyyyMMddHHmmss”,
    “missingValue” : null
    },
    “dimensionsSpec” : {
    “dimensions” : [ “col1”…“col180” ],
    “dimensionExclusions” : [ ],
    “spatialDimensions” : [ ]
    },
    “delimiter” : “\t”,
    “listDelimiter” : null,
    “columns” : [ “col1”…“col180” ]
    }
    },
    “metricsSpec” : [ {
    “type” : “count”,
    “name” : “count”
    } ],
    “granularitySpec” : {
    “type” : “uniform”,
    “segmentGranularity” : “HOUR”,
    “queryGranularity” : {
    “type” : “none”
    },
    “intervals” : [ “2015-10-30T00:00:00.000Z/2015-11-01T00:00:00.000Z” ]
    }
    },
    “ioConfig” : {
    “type” : “hadoop”,
    “inputSpec” : {
    “type” : “static”,
    “paths” : “s3://bucket/prefix/file.gz,s3://bucket/prefix/file.gz,s3://bucket/prefix/file.gz”
    },
    “metadataUpdateSpec” : {
    “type” : “mysql”,
    “connectURI” : “jdbc:mysql://IP:3306/druid”,
    “user” : “druid”,
    “password” : {
    “type” : “default”,
    “password” : “diurd”
    },
    “segmentTable” : “druid_segments”
    },
    “segmentOutputPath” : “/tmp/segments”
    },
    “tuningConfig” : {
    “type” : “hadoop”,
    “workingPath” : “/tmp/working_path”,
    “version” : “2015-11-19T10:26:47.797Z”,
    “partitionsSpec” : {
    “type” : “hashed”,
    “targetPartitionSize” : 5000000,
    “maxPartitionSize” : 7500000,
    “assumeGrouped” : false,
    “numShards” : -1
    },
    “shardSpecs” : { },
    “indexSpec” : {
    “bitmap” : {
    “type” : “concise”
    },
    “dimensionCompression” : null,
    “metricCompression” : null
    },
    “leaveIntermediate” : false,
    “cleanupOnFailure” : true,
    “overwriteFiles” : false,
    “ignoreInvalidRows” : false,
    “jobProperties” : {
    “fs.s3.awsAccessKeyId” : “HIDE”,
    “fs.s3.awsSecretAccessKey” : “HIDE”,
    “fs.s3.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,
    “fs.s3n.awsAccessKeyId” : “HIDE”,
    “fs.s3n.awsSecretAccessKey” : “HIDE”,
    “fs.s3n.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,
    “io.compression.codecs” : “org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec”
    },
    “combineText” : false,
    “persistInHeap” : false,
    “ingestOffheap” : false,
    “bufferSize” : 134217728,
    “aggregationBufferRatio” : 0.5,
    “useCombiner” : false,
    “rowFlushBoundary” : 80000
    }
    }
    }
    I run the following command on the master EMR node
    java -Xmx40g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/
    :/test/druid/poc/prod/config/_common/ io.druid.cli.Main index hadoop /ebs-stats/druid/poc/indexing/hadoop_sepc_1.json

emr.log (7.7 KB)

Are any of those dimensions high cardinality dimensions?

There are multiple columns that are of high cardinality, what would that imply?

That https://github.com/druid-io/druid/pull/1960 will probably help a lot

High cardinality columns are known to be a bit of a sore point in creating an index. There hasn’t been much effort in optimizing them, but I bet if you looked at the reducer logs you’d see a lot of time taken during the indexing of high cardinality columns

Hey Shuai,

The patch Charles linked should help a lot if you have high cardinality columns.

Other than that, from your EMR log it looks like the index reducer (which creates the Druid segments) is what’s taking most of the time. Two other things you can try:

  1. Have fewer rows per segment- which you can do by lowering the “targetPartitionSize” in your config. This is the target number of rows per segment. Your data is somewhat wide (180 columns) so you may benefit from lowering that row target. One reducer generates one segment, so this will improve your M/R parallelism.

  2. Have fewer intermediate persists- which you can do by raising the “rowFlushBoundary” in your config. This is the maximum number of rows that will be buffered in-heap before being flushed to disk. Watch for running out of heap, though.

Thanks Gian for the suggestion, 1 And 2 are definitely helping, I see 2 or 3 times performance improvement. However it is still not in the acceptable performance range.

I tried with Druid 8.2 which I think the patch Charles mentioned is already included but that did not seem to help

Another thought, could it be the string comparisons while indexing the data be the main source of slowness? All my data are in tsv, I'm trying to do real time indexing by tranquility where I'll translate tsv to json, key to object, not sure if the explicit type spec will help at all

Inline.

Thanks Gian for the suggestion, 1 And 2 are definitely helping, I see 2 or 3 times performance improvement. However it is still not in the acceptable performance range.

What is the acceptable performance range and why not use real-time ingestion if performance is critical?

I tried with Druid 8.2 which I think the patch Charles mentioned is already included but that did not seem to help

It is not included.

Another thought, could it be the string comparisons while indexing the data be the main source of slowness? All my data are in tsv, I’m trying to do real time indexing by tranquility where I’ll translate tsv to json, key to object, not sure if the explicit type spec will help at all

Why not use TSV directly? Druid supports it.

Probably because Tranquility does not support TSV, only JSON.