hadoop deep storage

Hi,
I’m following the instructions here:

http://druid.io/docs/latest/dependencies/deep-storage.html

to set up hadoop deep storage. I’ve got a basic hadoop setup working, but I’m getting exceptions in the realtime server when it tries to merge segments to deep storage. My common.runtime.properties has the following values:

druid.extensions.coordinates=[“io.druid.extensions:druid-kafka-eight”,“io.druid.extensions:mysql-metadata-storage”,“io.druid.extensions:druid-hdfs-storage”]

druid.extensions.localRepository=/opt/druid/repository

druid.extensions.remoteRepositories=

Deep storage

druid.storage.type=hdfs

druid.storage.storageDirectory=hdfs://hadoop:9000/druid

where hadoop:9000 is the name node. Note that I’m running both druid and hadoop in docker containers, which is to say that druid services are not local to to the hadoop installation.

The exception I’m getting is:

2015-09-09T14:31:00,512 ERROR [page_views-2015-09-09T14:20:00.000Z-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[page_views]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=No FileSystem for scheme: hdfs, interval=2015-09-09T14:20:00.000Z/2015-09-09T14:21:00.000Z}

java.io.IOException: No FileSystem for scheme: hdfs

at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2304) ~[?:?]

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2311) ~[?:?]

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) ~[?:?]

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350) ~[?:?]

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332) ~[?:?]

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369) ~[?:?]

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) ~[?:?]

at io.druid.storage.hdfs.HdfsDataSegmentPusher.push(HdfsDataSegmentPusher.java:83) ~[?:?]

at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:456) [druid-server-0.8.0.jar:0.8.0]

at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40) [druid-common-0.8.0.jar:0.8.0]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]

at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]

I dug into the hadoop sources and it seems like it wants a value for the property fs.hdfs.impl, which is in core-default.xml (https://hadoop.apache.org/docs/r1.0.4/core-default.html). That file apparently exists in hadoop-hdfs-*.jar, which should be on the class path for me (local repository configuration for extensions):

[root@1b16ebc845d1 druid]# jar tf /opt/druid/repository/org/apache/hadoop/hadoop-hdfs/2.3.0/hadoop-hdfs-2.3.0.jar|grep default

hdfs-default.xml

[root@1b16ebc845d1 druid]#

Can someone point out what I’m missing?

I did manage to find this:

http://druid.io/docs/latest/ingestion/faq.html

Make sure to include the druid-hdfs-storage and all the hadoop configuration, dependencies (that can be obtained by running command hadoop classpath on a machine where hadoop has been setup) in the classpath. And, provide necessary HDFS settings as described in Deep Storage .

It seems to imply that I am running the realtime nodes on machines that have a full hadoop installation (and I was hoping to get away with just running via the hadoop jars pulled in via druid-hdfs-storage). And the language is pretty vague with respect to “all the hadoop configuration”.

Like many other OSS projects, Druid relies on the hadoop FileSystem abstractions for HDFS access. As such, it requires some hadoop jars, as well as the proper hdfs client configuration on the classpath, in order to access HDFS. I’ve never experimented with the minimum set of hadoop jars required, but you certainly don’t need to be running hadoop or hdfs on the node.

Finally got this working. So just having the hadoop jars in the local repository doesn’t cut it. You also need the hadoop jars (and all of their dependencies) in the class path for the realtime node. So, assuming you have a local repository setup, something like this works:

CP=config/_common:config/realtime:find /opt/druid/repository -name '*.jar'|tr '\n' ':'

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=/opt/druid/rt.spec -Dfs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem -classpath $CP io.druid.cli.Main

server realtime

One last question though. The documentation is pretty clear that the whole point of keeping a separate repository for extensions is that it allows the extensions to be loaded by a separate class loader, so that dependencies of extensions don’t collide with druid core (as explained here: http://druid.io/docs/latest/operations/including-extensions.html). So all my solution is doing is bypassing the separate class loader and instead directly stuffing the class path with extension dependencies, which is prone to dependency conflicts. Given this, I feel like there should be a solution which does not involve directly adding to the class path.

So, to get this working via maven coordinates (e.g. in separate class loader from realtime server), I had to create a core-site.xml with fs.hdfs.impl property set, and then update the hadoop-common jar that’s in the local repository.

(I was under the impression that fs.hdfs.impl is set in core-default.xml which is in hadoop-common.jar, and it was at least, at one point: https://hadoop.apache.org/docs/r0.23.11/hadoop-project-dist/hadoop-common/core-default.xml). But its not set there anymore. Just adding core-site.xml to the realtime class path won’t cut it, because the extension jars are loaded to a separate class loader.

I’m pretty sure it would still work if druid-hdfs-storage jar was packaged with a core-site.xml that had the property values set. Not sure if there’s any downsides to doing that.