Docs/examples of hdfs deep storage

Hi - are there any examples or documentation that describe in detail all of the exact things you need to do to make hdfs deep storage work in Druid? I’ve set the following configs in historical and overlord nodes, but this doesn’t seem to be enough:

druid.extensions.coordinates=[“io.druid.extensions:druid-hdfs-storage”]

druid.storage.type=hdfs

druid.storage.storageDirectory=hdfs://my.hdfs.com:1234/druid

The indexing task logs show the “No FileSystem for scheme: hdfs” error message.

I’ve found bits and pieces of hdfs information on the mailing list, but wondering if there’s a full, known working example anywhere?

Thanks,

Zach

Hi Zach, do you see the same messages if you set the deep storage in the common.runtime.properties? It was the goal of the common properties to set your deep storage and extensions there and not have to think about what nodes need them.

We should have more docs on hdfs setup, hopefully folks with more experience running hdfs in production can add their experiences.

Hi
you have to set also
druid.indexer.task.hadoopWorkingPath=hdfs://path

Hi Fangjin - those 3 config values are actually placed into common.runtime.properties on the historical and overlord nodes.

I’ve been trying to achieve Solution #1 from [1] by putting the hadoop jars at the end of the classpath, have not succeeded yet but still trying. Will report my findings.

Thanks,

Zach

[1] https://github.com/druid-io/druid/pull/1022

Hi Slim - currently I’m only sending real-time data to Druid via Tranquility, not doing any batch indexing tasks. Do I still need to set druid.indexer.task.hadoopWorkingPath, or is that just for batch tasks?

Thanks,

Zach

I do not have experience with Tranquility but i have used realtime node with hdfs and the only properties you have to set are druid.storage.type and druid.storage.storageDirectory.
but in my case i do not specify the coordinates extensions. In my case i load the jars using the classpath

druid.extensions.coordinates=[]

Hi Slim - thanks for the info. How do you obtain all of the hadoop jars to put on the classpath?

Thanks,

Zach

In my use case the hadoop jars are installed by SE guys in specific location. Thus i just supply the path to those jars in a CLASSPATH environment variable using daemontools.
Although you don’t have to do it the same way i do, it is equivalent to the -classpath when you you run the java cmd.
How you do start the druid processes ?

PS you can see what are the loaded jars by running this CMD
sudo lsof -p <YOUR_PROCESS_PID> | grep hadoop-hdfs

Hi Zach, for your particular problem, are you running the indexing service in local or remote mode?

During initial testing we’re running overlord with:

druid.indexer.runner.type=local

I actually just got things working with hdfs:

https://github.com/Banno/druid-docker/commit/49ab8ee7f7e4af6250e8ddd9a9c0d88d1c93847d

That basically just puts the hadoop jars on the end of the classpath, as others have suggested. Task logs show segments being transferred to hdfs, I can see the segment dirs in hdfs, coordinator ui shows segments in hdfs, query results are correct, etc.

So it works, but it’s a really awful hack. Even worse, it’s undocumented. Seems like this should be all you need to do to enable hdfs:

druid.extensions.coordinates=[“io.druid.extensions:druid-hdfs-storage”]