On Hadoop Indexing

Hello Druid gurus, I was digging through the Druid documentation and I came across the following instruction on the Tutorial: Loading Batch Data page:

java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath config/_common:config/overlord:lib/*:<hadoop_config_path> io.druid.cli.Main server overlord

``

Is the <hadoop_config_path> a typo? If not, what does it mean? And how is it different from the following command mentioned on the Batch Data Ingestion page:

java -Xmx256mb -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_path> io.druid.cli.Main index hadoop <spec_file>

``

Thanks,

Tim

try $(hadoop classpath)

Hi Charles, thank you. So, in what scenarios should I include the <hadoop_config_path>?

Thanks,

Tim

Here’s an example of what that command yields on my system:

echo (hadoop classpath)

/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/common/lib/:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/common/:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/hdfs:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/hdfs/lib/:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/hdfs/:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/yarn/lib/:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/yarn/:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/lib/:/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/:/contrib/capacity-scheduler/*.jar

Note that the first part of the classpath is /usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop

This is the path to the hadoop config. The rest is just libraries that hadoop needs to do its thing.

Yeah, but I guess what I am trying to ask is, why would I need to include Hadoop when I am running an indexing task against the Indexing Service (Overlord), as shown in the command below?

java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/_common:config/overlord:$(hadoop_classpath) io.druid.cli.Main server overlord

``

Is that necessary?

Ah, I see now. I had in my head you were running the hadoop task straight out.

I know FJ has been tinkering with hadoop config stuff a lot recently so I’ll have to defer to him on exactly where it minimally must be included.

In general, though, I’m assuming you’ll want to at least be familiar with http://druid.io/docs/0.7.3/Other-Hadoop.html

Thanks, Charles. I looked at the doc you shared, and it mentions that I can use a Hadoop that is different from the OOTB one.

So, this looks like if I want to use a different Hadoop deployment for the “Hadoop Index Task”, I will need to include it in the classpath like so:

java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/_common:config/overlord:$(hadoop_classpath) io.druid.cli.Main server overlord

``

Plus, of course, I must specify in the index task JSON that the task type is “index_hadoop” in order to trigger the “Hadoop Index Task”.

As for the following command, this one issues the HadoopDruidIndexer instead of the indexing service:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_path> io.druid.cli.Main index hadoop <spec_file>

``

I suppose the only time I could use that command is when I have a separate Hadoop deployment, or can I use it with the OOTB one?

Hadoop-based batch data ingestion in Druid requires a remote Hadoop cluster to be present. The documentation provides some guidelines around rejiggering the classpath to support various versions of Hadoop. The different between the hadoop index task and the command line indexer is really around the fact that one runs in the context of the indexing service.

Thanks, Fangjin, I figured the same. In fact, I ran the following the command:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/_common io.druid.cli.Main index hadoop examples/indexing/wikipedia_hadoop_config.json

``

With the following wikipedia_hadoop_config.json:

{

“dataSchema”: {

“dataSource”: “wikipedia”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “timestamp”,

“format”: “auto”

},

“dimensionsSpec”: {

“dimensions”: [

“page”,

“language”,

“user”,

“unpatrolled”,

“newPage”,

“robot”,

“anonymous”,

“namespace”,

“continent”,

“country”,

“region”,

“city”

],

“dimensionExclusions”: ,

“spatialDimensions”:

}

}

},

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

},

{

“type”: “doubleSum”,

“name”: “added”,

“fieldName”: “added”

},

{

“type”: “doubleSum”,

“name”: “deleted”,

“fieldName”: “deleted”

},

{

“type”: “doubleSum”,

“name”: “delta”,

“fieldName”: “delta”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: “NONE”,

“intervals”: [“2013-08-31/2013-09-01”]

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“paths”: “examples/indexing/wikipedia_data.json”

},

“metadataUpdateSpec”: {

“type”: “mysql”,

“connectURI”: “jdbc:mysql://localhost:3306/druid”,

“user”: “druid”,

“password”: “diurd”,

“segmentTable”: “druid_segments”

},

“segmentOutputPath”: “/tmp/segments”

},

“tuningConfig”: {

“type”: “hadoop”,

“workingPath”: “/tmp/working_path”,

“partitionsSpec”: {

“targetPartitionSize”: 5000000

}

}

}

``

And it worked! I am able to see segments created in the druid_segments! And I am also able to query the data. Does this mean that I was able to trigger the HadoopDruidIndexer against the OOTB Hadoop deployment that ships with Druid?

Hi Tim, yes, if the segment was created things should have worked correctly.