Remote hadoop indexing - How to make it run faster (reducer parallelism?)

Hi there,
I have been running my indexing jobs on a remote hadoop cluster and have the metadata updated in the druid cluster once job is complete. Currently the input data is a text file compressed.

Raw size = 30 GB (100 M rows)

Compressed (gzip) = 6.2 GB

Number of part files = 100 (64 MB compressed size gz)

So every day’s run uses 100 mappers (easily given that my split size & block size if almost same).

The mapper phase of the indexer runs pretty fast within few mins (say 5-6 mins). But the number of reducer is just “1” determined by the indexer and also the reducer takes quite a lot of time - about 2 hours in whole to complete and upload the data to S3/HDFS.

So my question is:

  • How do we increase the reducer parallelism?

  • Which parameters in the input JSON which affects the number of reducers? (I will attach few parameters I tried - but none seem to work or change)

  • How is the number of output segments decided - is it just numShards / targetPartitionSize (under partitionSpec)? Or something else?

  • Is there a direct correlation to the number of segments to the number of reducers?

Currently 2.5 hours to just index one days worth of data is not acceptable given our whole day’s processing takes much lesser time.

So what is the limiting factor on the reducer? Why is it running this slow? What are the tunable options available to users?

Try1:

"tuningConfig" : {

  "type" : "hadoop",

  "buildV9Directly": "true",

  "maxRowsInMemory": 1000000,

  "workingPath": "/tmp/druid-indexing"

  },

  "partitionsSpec" : {

    "type" : "hashed",

    "numShards": 10

  },

``

Try2:

"tuningConfig" : {

  "type" : "hadoop",

  "buildV9Directly": "true",

  "maxRowsInMemory": 1000000,

  "numBackgroundPersistThreads": 1

  },

  "partitionsSpec" : {

    "type" : "hashed",

    "targetPartitionSize" : 10000000

  },

``

Any piece of advice is greatly appreciated?

Thanks,

Uday.

Any help is greatly appreciated :slight_smile:

~Uday