Druid 0.8.3 - HDFS druid.storage.storageDirectory value?

What value should I specify for druid.storage.storageDirectory, when I have multiple hadoop nodes? Should I be load balancing my nodes under different cluster names?

druid.storage.storageDirectory=hdfs://:8020/path/druid-hdfs-storage

``

My current Hadoop cluster is compromised of the following.

hadoopm01 - Hadoop Master Node 1
hadoopm02 - Hadoop Master Node 2
hadoops01 - Hadoop Slave Node 1
hadoops02 - Hadoop Slave Node 2
hadoops03 - Hadoop Slave Node 3

Just to follow-up to my original post.
Am I correct in my thinking that I want my hdfs uri (druid.storage.storageDirectory) to point to a name node?
Currently, my Hadoop Master Nodes run the following, HDFS NameNode, YARN Resource Manager and HBase Master and my Hadoop Slave Nodes run the following services: HDFS DataNode, YARN NodeManager, HBase RegionServers.
Which of the following are valid options, and which is the the better options?
Option 1:

druid.storage.storageDirectory=hdfs://hadoopm01:8020/path/druid-hdfs-storage

``

Option 2:

druid.storage.storageDirectory=hdfs://hadoopm01:8020;hadoopm02:8020/path/druid-hdfs-storage

``

Option 3:

druid.storage.storageDirectory=hdfs://hadoopm:8020/path/druid-hdfs-storage

``

  • Where the Hadoop Master Node cluster name is “hadoopm” (hadoopm01, hadoopm02)
    Option 4:
    druid.storage.storageDirectory=hdfs://hadoopnn:8020/path/druid-hdfs-storage

``

  • Where the Hadoop NameNode cluster name is “hadoopnn” (hadoopm01, hadoopm02)

I seemed to have been able to successfully use Option 2.

druid.storage.storageDirectory=hdfs://hadoopm01:8020;hadoopm02:8020/path/druid-hdfs-storage

``

Was surprised that I could specify multiple hdfs servers like this as it was not mentioned in the documentation: http://druid.io/docs/latest/dependencies/deep-storage.html

It would be great if someone could outline their experience with connecting Druid to a HDFS cluster. Any comments appreciated.

Hi,

Druid uses hdfs-client jar provided by hadoop to access hdfs, so it supports multiple name nodes in the path. Also, you can choose (and I prefer to do so) to not specify namenode information at all in the storageDirectory property but have those details in hadoop configuration files and put them in the classpath. hadoop client jar can read the namenode information from the hadoop configuration files.

– Himanshu

Hi Himanshu. How do you add your Hadoop classpath to the Druid node execution path?
I am currently using Apache Ambari to manage my Hadoop cluster, is the following inline with your strategy?
The launch command for my Druid Historical Node:

/usr/bin/java -Xms4g -Xmx4g
-XX:NewSize=2g -XX:MaxNewSize=2g -XX:MaxDirectMemorySize=8g -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/opt/druid/druid-tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-classpath config/_common:config/historical:lib/*:$(/usr/hdp/current/hadoop-client/bin/hadoop classpath) io.druid.cli.Main server historical

``

The running process for my Druid Historical Node:

/usr/bin/java -Xms4g -Xmx4g -XX:NewSize=2g -XX:MaxNewSize=2g -XX:MaxDirectMemorySize=8g -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCps -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/opt/druid-tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -classpath config/_common:config/historica/usr/hdp/2.3.2.0-2950/hadoop/conf:/usr/hdp/2.3.2.0-2950/hadoop/lib/:/usr/hdp/2.3.2.0-2950/hadoop/.//:/usr/hdp/2.3.2.0-2950/hadoop-hdfs/./:/usr/hdp/2.3.2.0-2950/hadoop-hdfs/lib/:/usr/hdp/2.30/hadoop-hdfs/.//:/usr/hdp/2.3.2.0-2950/hadoop-yarn/lib/:/usr/hdp/2.3.2.0-2950/hadoop-yarn/.//:/usr/hdp/2.3.2.0-2950/hadoop-mapreduce/lib/:/usr/hdp/2.3.2.0-2950/hadoop-mapreduce/.//:::/us3.2.0-2950/tez/:/usr/hdp/2.3.2.0-2950/tez/lib/:/usr/hdp/2.3.2.0-2950/tez/conf io.druid.cli.Main server historical

``

Hey Mark,

Something looks messed up with the classpath you linked, I’m not sure if this is a copy/paste error or an actual problem. But one of your paths is “config/historica/usr/hdp/2.3.2.0-2950/hadoop/conf” so it looks like something got chomped.

At any rate, in general things should work if you include a directory containing your hadoop XMLs on your classpath. I think hdfs-site.xml is the important one, but having them all is good too.

Hi Gian,

Don’t really know what happened with that previous running process execution string.

Anyway, my current classpath is described in this other thread https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/druid-user/jR6-KFvDceE/u6j-4sbTCAAJ , and it seems to be working.

/usr/bin/java -Xms4g -Xmx4g -XX:NewSize=2g -XX:MaxNewSize=2g -XX:MaxDirectMemorySize=8g -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/opt/druid/druid-tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -classpath config/_common:config/historical:lib/:/opt/druid/extensions-repo/org/apache/hadoop/hadoop-client/2.7.1/hadoop-client-2.7.1.jar:/opt/druid/extensions-repo/org/apache/hadoop/hadoop-hdfs/2.7.1/hadoop-hdfs-2.7.1.jar:/usr/hdp/current/hadoop-client/:/usr/hdp/current/hadoop-client/lib/* io.druid.cli.Main server historical

``

Also, I actually had to keep Hadoop’s configuration from my Druid classpath as this created a whole bunch of exceptions when I tried to start Druid. Maybe I got these issues because I am using an Apache Ambari configured version of a Hadoop distribution.

Hey Mark, okay, great to hear that things are working!

There was another option that I decided to go with for now.

Option 5:

As I am running my Hadoop cluster deployment with High Availability (Configured NameNode HA using Ambari) my storage directory value is based on the Nameservice ID value “hadoopc”.

druid.storage.storageDirectory=hdfs://hadoopc:8020/path/druid-hdfs-storage

``

Nameservice ID Configuration of NameNode HA using Apache Ambari:

Hi Himanshu,

How do you add Hadoop configuration files to the classpath of druid? You just add the folder of your configuration files (core-site.xml, hdfs-site.xml) to the Druid classpath. Do you have an example?

Create folder for xml files. Put xml files in folder. Put “:folder_name/*:” in java cp.

I have tended to my configuration files (core-site.xml, hdfs-site.xml) in the form of “:folder_name/” without the wildcard (":/opt/myapp/druid-0.8.3/config/_common-classpath:"). Is this a mistake, should I be adding the full .xml file path?

Below is a rough sample of my Druid Overlord classpath:

druid-overlord-start.sh

#!/bin/sh

/usr/bin/java -Xms2g -Xmx2g \

-XX:NewSize=256m -XX:MaxNewSize=256m -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps \

$(/opt/myapp/druid-0.8.3/_druid-arguments.sh) \

-classpath config/_common:config/overlord:lib/*:$(/opt/myapp/druid-0.8.3/_druid-classpath.sh) io.druid.cli.Main server overlord

``

_druid-arguments.sh

#!/bin/sh

echo " \

-Duser.timezone=UTC \

-Dfile.encoding=UTF-8 \

-Djava.io.tmpdir=/opt/myapp/druid-tmp \

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager \

"

``

_druid-classpath.sh

#!/bin/sh

echo "\

/opt/myapp/druid-0.8.3/config/_common-classpath\

:/opt/myapp/druid-0.8.3/extensions-repo/org/apache/hadoop/hadoop-client/2.7.1/hadoop-client-2.7.1.jar\

:/opt/myapp/druid-0.8.3/extensions-repo/org/apache/hadoop/hadoop-hdfs/2.7.1/hadoop-hdfs-2.7.1.jar\

:/usr/hdp/current/hadoop-client/*\

:/usr/hdp/current/hadoop-client/lib/*\

"

``

Also, someone just mentioned in another thread that I might be able to drop my core-site.xml and hdfs-site.xml files into my Druid config/_common directory. Does this sound like it could work? Are there any issues with this approach?

Try :/opt/myapp/druid-0.8.3/config/_common-classpath/*: