Speeding up Hadoop Index Task

I am back-filling large periods of data from HDFS but it’s proving slow (10+ of hours per month) and am looking to speed up hadoop ingestion.

My data has 35 dimensions and 36 metrics and I need it aggregated at the hourly level. It is being stored as CSVs on HDFS and amounts to ~40 GB per day.

Sample Row:

2017-03-01T07:00:00.000Z,US:OK,157876,0,7,604094,…

Here are the granularity and tuning portions from my ingestion spec.

“granularitySpec”:{

            "type":"uniform",

            "segmentGranularity":"hour",

            "queryGranularity": "none",

            "rollup":true,

            "intervals":[ "2016-01-01/P1M" ]

},

“tuningConfig”: {

        "type":"hadoop",

        "targetPartitionSize":5000000,

        "rowFlushBoundary":75000,

        "numShards":-1,

        "indexSpec":{

           "bitmap":{

              "type":"concise"

           },

           "dimensionCompression":"lz4",

           "metricCompression":"lz4",

           "longEncoding":"longs"

        },

        "buildV9Directly":false,

        "forceExtendableShardSpecs":true

     }

}

I am running each hadoop ingestion task on a month’s worth of data at once and also confirmed that the MapReduce task is not running locally.

2017-05-03T17:29:55,357 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2017-05-03T17:29:55,531 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 744
2017-05-03T17:29:56,187 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - number of splits:6913
2017-05-03T17:29:56,256 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1492526761603_0101

Is there anything I can change in the ingestion spec or my process (load a day's worth instead of a month's worth) to speed up ingestion?

On the Druid side one thing you can do is set buildV9Directly to true, or upgrade to Druid 0.10.0 where it’s true by default. Raising rowFlushBoundary could help too, although not too high since you don’t want to run out of memory. Other than that, the best thing to do is follow ‘standard’ Hadoop performance tuning steps: turn on combining if you have a lot of small files, potentially adjust mapper/reducer container sizing, that kind of thing.

Thanks Gian, your suggestions greatly sped up the ingestion task.