Ingesting local files with HDFS configured as deep storage

I’m fairly new to druid and I am wondering if it’s possible to ingest files, which are stored locally, using a hadoop-indexing task while HDFS is configured as deep storage (I know this is possible with an index-task and firehose). When I tried to do this, I found that Druid was searching the hdfs directory for the files to ingest.

My intuition is that hadoop-indexing tasks must use files that are stored in hdfs. What confuses me is that the quickstart tutorial used a hadoop index task and had the to-be-ingested files stored locally, however it used local disk as deep storage.

Any clarification on this is much appreciated. Thank you!

Hi Edward,

     I'm too looking to ingest local files using remote hadoop cluster and am facing the same problem(druid is looking in hdfs directory). 
     Did you find anything related to this?

THANKS

Hi Everyone,
could anyone look into this and point me in the right direction?

You can do it by using an input path starting with “file://”.

Gian

Hey Gian,

Thanks for the answer.

I have started working on hadoop ingestion using remote-hadoop-cluster (http://druid.io/docs/latest/tutorials/tutorial-batch-hadoop.html) and stuck with another problem.

After following the tutorial, overlord shows the SUCCESS status of ingestion when specified file is in HDFS.

Then I changed the ingestion-spec so that ioConfig.inputSpec.paths points to ‘file:///tmp/shared/wikiticker-2015-09-12-sampled.json.gz’, task failed after 45-46 seconds.

ingestion-spec - attempt-1.json, log - attempt-1.log

firstline in stacktrace Error: java.io.FileNotFoundException: File file:/tmp/shared/wikiticker-2015-09-12-sampled.json.gz does not exist

when I pointed the ioconfig to a file which doesnt exists, on purpose, task failed in 11 seconds.

ingestion-spec - attempt-2.json, log - attempt-2.log

So, I put the same data file in the docker vm, in the same location and resubmission of task gave SUCCESS status after 81 seconds.

ingestion-spec - attempt-1.json, log - attempt-3.log

Please look into this

attempt1.log (174 KB)

attempt-1.json (2.39 KB)

attempt-2.json (2.36 KB)

attempt-2.log (162 KB)

attempt-3.log (210 KB)

can anyone look into this? kind of stuck with this problem for a while.

thanks

From your message with the task log attachments, the behavior is as expected. If you use “file://” as the input, the file must also be available on the local filesystem of the Hadoop cluster.

Ingesting a file from the druid cluster’s local filesystem without copying it to the hadoop cluster is not supported.

Thanks,

Jon