Load and Index data from HDFS

Hello Druid Guru’s,

I have few questions before i load and index data from hdfs ( i did try an example did fail because i am sure i missed few steps before i run the task. hence the questions below)

  1. Do i have to add/modify any config or properties files in druid installation to provide path or connectivity from druid to hdfs?
  2. Are there any other environment settings i need to be aware of other than question 1



Hey Karteek, if you’re not doing it already, try including your Hadoop jars and Hadoop config XMLs on Druid’s classpath. Also, if you’re using HDFS for deep storage (as opposed to just using Hadoop for indexing) then make sure to include the druid-hdfs-storage module as one of your extensions and set druid.storage.type=hdfs on all your Druid nodes.

thanks Gian…glad to see your answer to this question.I’ve got confused how to use HDFS for my deep storage. I’ve used local file system as deep storage and it worked well, then i try to replace it with hdfs (p.s. i’ve got a hadoop cluster), i set the config file for editting storage type as hdfs, and provided a path such as “hdfs://:9000/”, i boot the druid, run my start-dfs.sh to boot hdfs. However, as a result it didn’t keep my data into hdfs, instead, it make a recursion directories named hdfs:, :9000, . Well, i have to admit that i didn’t including my Hadoop jars and config XMLs on Druid’s classpath. Would you please help me and tell me how to do it so that i can successfully set up deep storage using hdfs, thanks…

在 2015年7月22日星期三 UTC+8上午5:11:58,Gian Merlino写道:

Hi Jz,

You’ll need to do 3 things to use HDFS for deep storage.

  1. Include the HDFS extension in your list of extensions.

  2. Set the proper configs to HDFS

  3. Include relevant hadoop configuration files in the classpath of the nodes you are using.

If this doesn’t work, can you share your ingestion spec?

Hi, Fangjin,

Thanks a lot. What is “proper configs” in step 2) ?

Besides, i do the step 1) and 3), but Druid realtime occurs No filesystem scheme: hdfs. I guess it must be that i set the hadoop configuration in the classpath by the wrong way.there is part of my _common/common.runtime.properties and the command line i run historical(i copy all the XMLs of hadoop in a directory call hadoopConf under druid/config )

config file:




Metadata Storage (mysql)





Deep storage (local filesystem for examples - don’t use this in production)

storageType = local



storageType = hdfs



way to boot historical:

#! /bin/bash


java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath config/_common:config/historical:lib/*:${HADOOP_OPTS} io.druid.cli.Main server historical

and the below is the log of realtime:

2015-08-25T10:58:01,224 ERROR [wikipedia-2015-08-25T10:50:00.000Z-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[wikipedia]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=No FileSystem for scheme: hdfs, interval=2015-08-25T10:50:00.000Z/2015-08-25T10:55:00.000Z}

java.io.IOException: No FileSystem for scheme: hdfs

at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2304) ~[?:?]

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2311) ~[?:?]

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) ~[?:?]

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350) ~[?:?]

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332) ~[?:?]

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369) ~[?:?]

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) ~[?:?]

at io.druid.storage.hdfs.HdfsDataSegmentPusher.push(HdfsDataSegmentPusher.java:83) ~[?:?]

at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:456) [druid-server-0.8.0.jar:0.8.0]

at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40) [druid-common-0.8.0.jar:0.8.0]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_51]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_51]

at java.lang.Thread.run(Thread.java:744) [?:1.7.0_51]

2015-08-25T10:58:01,237 INFO [wikipedia-2015-08-25T10:50:00.000Z-persist-n-merge] com.metamx.emitter.core.LoggingEmitter - Event [{“feed”:“alerts”,“timestamp”:“2015-08-25T10:58:01.237Z”,“service”:“realtime”,“host”:“localhost:8084”,“severity”:“component-failure”,“description”:“Failed to persist merged index[wikipedia]”,“data”:{“class”:“io.druid.segment.realtime.plumber.RealtimePlumber”,“exceptionType”:“java.io.IOException”,“exceptionMessage”:“No FileSystem for scheme: hdfs”,“exceptionStackTrace”:“java.io.IOException: No FileSystem for scheme: hdfs\n\tat org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2304)\n\tat org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2311)\n\tat org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)\n\tat org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)\n\tat org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)\n\tat org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)\n\tat org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)\n\tat io.druid.storage.hdfs.HdfsDataSegmentPusher.push(HdfsDataSegmentPusher.java:83)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:456)\n\tat io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:744)\n”,“interval”:“2015-08-25T10:50:00.000Z/2015-08-25T10:55:00.000Z”}}]

2015-08-25T11:00:00,250 INFO [chief-wikipedia[0]] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2015-08-25T11:00:00.000Z_2015-08-25T11:05:00.000Z_2015-08-25T11:00:00.000Z] at path[/druid/segments/localhost:8084/localhost:8084_realtime__default_tier_2015-08-25T10:48:44.407Z_9aa0cac5257a4091b8e0cfa8b4d050f70]

在 2015年8月25日星期二 UTC+8上午9:52:40,Fangjin Yang写道:

Hi, Fangjin,

Thanks for your help, i make it !

As for the no filesystem scheme: hdfs exception, it’s because the losing setting for fs.hdfs.impl in hadoop core-site.xml, now i add it into core-site.xml and it work !

在 2015年8月25日星期二 UTC+8上午9:52:40,Fangjin Yang写道:

Hi Jz, great to hear you got it working!

Hi FJ,

These days i took a try for the deployment of druid-0.9.1, i successfully started up the coordinator, broker, overload, middleManager, and loaded batch example data refer to the document “quickstart”, also i could see “SUCCESS” in the page http://localhost:8090/console.html.

However, when i tried to test the streaming data using tranquility tool as the document telling, an exception occurred as following:


i listed the id.druid.service.jar under tranquility/lib and indeed there wasn’t such file. What’s more, i copyed other .jar file including the AbstractTask class under the ${druid.home}/lib, but i still didn’t work.


pls help and give me some advice if u have any idea with it, thanks a lot.


Hey Jz, have you figured this out yet? It looks like you’re trying to include the kafka-indexing-service extension into Tranquility which is not supported. The Kafka indexing service is independent from Tranquility and the extension should be loaded onto the overlord and middle manager nodes.

Hi David,

Finally i worked it out by replace tranquility with a version-0.8.0 one, it seems like version compatibility problem.

Thanks for your attention!