Hadoop batch with 0.10.1

Has any one been able to successfully do a hadoop based batch ingestion with 0.10.1 and newer releases?

In the logs we see that with 0.9.1, the loadspec was:

loadSpec:

{

“type”:“s3_zip”,

“bucket”:“s3bucket”,

“key”:"/druid/ … /index.zip"

}

And with 0.10.1, the loadspec is:

loadSpec:

{

“type”:“local”,

“path”:"/druid/ … /index.zip"

}

Because of the wrong load spec, the historical nodes are not able to download any segments.

Loading data using the below spec worked fine on 0.9.1, but has been broken since we upgraded our environment to use 0.10.1 for batch ingest.

"ioConfig": {

    "inputSpec": {

        "table_path": "path",

        "paths": "s3n://inputpath",

        "type": "static"

    },

    "metadataUpdateSpec": {

        "connectURI": "jdbc:mysql://[druid.db.prod.abc.net:3306/druid](http://druid.db.prod.abc.net:3306/druid)",

        "segmentTable": "druid_test_segments",

        "type": "mysql",

        "user": "druid"

    },

    "segmentOutputPath": "s3n://segment_output_path/bdp_druid_test",

    "type": "hadoop"

},

"tuningConfig": {

    "jobProperties": {

        "mapreduce.map.speculative": false,

        "mapreduce.output.fileoutputformat.compress": false,

        "mapreduce.reduce.memory.mb": 6144,

        "mapreduce.reduce.speculative": false,

        "mapreduce.user.classpath.first": true,

    },

    "leaveIntermediate": false,

    "partitionsSpec": {

        "numShards": 200,

        "partitionDimensions": [

            "alloc_id"

        ],

        "type": "hashed"

    },

    "type": "hadoop",

    "workingPath": "/tmp/druid/c1714a73-fe87-11e7-8c2c-784f4390605c"

}

Hi, Samarth, we’ve been successfully running Hadoop batch ingestion with Druid 0.10.1 for quite some time. The trick to getting it working is to use the correct version of the Hadoop client libraries. Druid 0.10.1 was built against Hadoop 2.7.3, so you need to make sure that you have the 2.7.3 libraries in the hadoop-dependencies directory of your Druid installation. Even if your YARN cluster is not 2.7, using the 2.7.3 client may still work. For example, our YARN cluster is 2.6 (CDH 5.8.3) and the 2.7.3 client libraries are backwards compatible. If your YARN cluster version is much older, try matching Druid’s hadoop client library to your cluster’s version as detailed here: http://druid.io/docs/latest/operations/other-hadoop.html. Also, be sure to set the hadoopDependencyCoordinates property in your indexing task:

“hadoopDependencyCoordinates” : [ “org.apache.hadoop:hadoop-client:2.7.3” ],

Thanks for the reply, TJ. Where is your input data stored? And where are you writing the segment files to? In our case, both locations point to an S3 bucket.
Are you configuring druid.storage.type somewhere? Or are you supplying it as a param to your map-reduce job?

We use HDFS to store both the input data for reingestion as well as the output segments. We set druid.storage.type to “hdfs” in the common properties file used by all Druid services (_common/common.runtime.properties) as well as directly in the Middle Manager configs so that the peons also use HDFS: druid.indexer.fork.property.druid.storage.type=hdfs.

Hi Samarth,

You’ll need to set druid.storage.type=hdfs in your common runtime properties in 0.10.1 druid or later.

In previous versions, the loadSpec was determined from the URL, but now this decision is made using the configured druid.storage.type property.

Or in this case since you’re using S3, druid.storage.type=s3

Thanks for replying Jonathan and TJ. I will make necessary changes on my side.

The behavior change tripped us since we have never included the common.runtime.properties file in the classpath of the HadoopDruidIndexerJob before. The release notes for 0.10.1 https://github.com/druid-io/druid/issues/4384 didn’t mention this new requirement either.