Read from Hadoop without using docker

Hello Team,

We want to use druid for our projects and I'm going through the Quickstart guide. Everything went well up till reading from Kafka but, when it comes to read from Hadoop. That guide shows how to do it after installing Hadoop docker image. As we already have Hadoop cluster we don’t want to install Hadoop docker image.

Is any documentation out there for reading file from Hadoop without using docker stuff?

Thank you,
Abhijeet Kumar

Actually when I’m skipping docker part I’m getting error:

2018-12-12T10:41:32,883 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[AbstractTask{id=‘index_hadoop_wikipedia_2018-12-12T10:41:28.218Z’, groupId=‘index_hadoop_wikipedia_2018-12-12T10:41:28.218Z’, taskResource=TaskResource{availabilityGroup=‘index_hadoop_wikipedia_2018-12-12T10:41:28.218Z’, requiredCapacity=1}, dataSource=‘wikipedia’, context={}}]

io.druid.java.util.common.ISE: Hadoop dependency [/opt/druid/hadoop-dependencies/hadoop-client/2.8.3] didn’t exist!?

at io.druid.initialization.Initialization.getHadoopDependencyFilesToLoad(Initialization.java:279) ~[druid-server-0.12.3.jar:0.12.3]

at io.druid.indexing.common.task.HadoopTask.buildClassLoader(HadoopTask.java:160) ~[druid-indexing-service-0.12.3.jar:0.12.3]

at io.druid.indexing.common.task.HadoopTask.buildClassLoader(HadoopTask.java:134) ~[druid-indexing-service-0.12.3.jar:0.12.3]

at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:175) ~[druid-indexing-service-0.12.3.jar:0.12.3]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:444) [druid-indexing-service-0.12.3.jar:0.12.3]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:416) [druid-indexing-service-0.12.3.jar:0.12.3]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_191]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]

Steps taken:

  1. I started my Hadoop cluster

  2. mkdir -p /tmp/shared/hadoop-xml

  3. cp quickstart/wikiticker-2015-09-12-sampled.json.gz /tmp/shared/wikiticker-2015-09-12-sampled.json.gz

Hdfs dfs -mkdir /druid
hdfs dfs -mkdir /druid/segments
hdfs dfs -mkdir /quickstart
hdfs dfs -chmod 777 /druid
hdfs dfs  -chmod 777 /druid/segments
hdfs dfs -chmod 777 /quickstart
hdfs dfs -chmod -R 777 /tmp
hdfs dfs -chmod -R 777 /user
hdfs dfs -put /shared/wikiticker-2015-09-12-sampled.json.gz /quickstart/wikiticker-2015-09-12-sampled.json.gz
  1. cp /usr/local/hadoop/etc/hadoop/*.xml /shared/hadoop-xml

  2. cp /tmp/shared/hadoop-xml/*.xml {PATH_TO_DRUID}/examples/conf/druid/_common/hadoop-xml/

  3. Modified conf for druid using hdfs instead of using local file system

  4. curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @examples/wikipedia-index-hadoop.json http://localhost:8090/druid/indexer/v1/task

  5. Submission was successful

  6. In the log I’m getting error like above

io.druid.java.util.common.ISE: Hadoop dependency [/opt/druid/hadoop-dependencies/hadoop-client/2.8.3] didn’t exist!?

I would check that this path exists and has the correct permissions on your middle manager node(s)