Too slow ingestion (Druid 0.8.3)

Hi, community.

I faced with too slow batch ingestion. My Druid version is 0.8.3. I’m importing 700M records into Druid using Hadoop cluster in AWS. Hadoop cluster consists of 5 worker nodes (r3.8xlarge, 244Gb of memory).

It takes more than 24 hours to import this data (average performance rate is 6,720 lines/sec). As far as I understand during the ingestion Druid tries to pre-aggregate my data. But since my data is already pre-aggregated I don’t need this feature. There is a property “assumeGrouped”, in “partitionsSpec”. In the documentation this parameter is described only for “Single dimension partitioning” (http://druid.io/docs/0.8.3/ingestion/batch-ingestion.html):

“partitionsSpec” : {

“type” : “dimension”,

“partitionDimension” : null,

“targetPartitionSize” : 5000000,

“maxPartitionSize” : 7500000,

“assumeGrouped” : false,

“numShards” : -1

}

``

But I’m using the “Hashed-based partitioning”. Here is my “tuningConfig” section:

“tuningConfig” : {

“type” : “hadoop”,

“rowFlushBoundary” : 200000,

“maxRowsInMemory” : 200000,

“partitionsSpec” : {

“type” : “hashed”,

“targetPartitionSize” : 1000000,

“assumeGrouped” : true

}

}

``

My questions are:

  1. Does “assumeGrouped” property work for “Hashed-based partitioning”?
  2. If not, then how can turn off the roll-up and increase the ingestion performance?

Hey Ilia,

The hash-based partitioning scheme does not have a pre-grouping step, so assumeGrouped does not have an effect on it. Druid indexing with hashed partitionsSpec is usually a two job process (one for determining the partitioning and one for doing the actual indexing). Could you look into which job is slow and whether it’s the mappers or reducers that seem to be the bottleneck?

Some other things to look into,

  • If you aren’t getting enough index-generator reducers (on the second job), you can adjust segmentGranularity or targetPartitionSize such that you produce more segments, which will give you more reducers. You don’t want to go too extreme with this, though, as generating a lot of tiny segments can be problematic for query performance.

  • If your index-generator reducers are getting a lot more input rows than your configured targetPartitionSize, you can cut that down by setting “useCombine”: true to enable the combiner.

  • If you know how many shards you’ll need, using numShards instead of targetPartitionSize allows the indexer to skip the first hadoop job.