Hadoop Indexing Service for batch ingestion of data lying in remote node

Hello Friends,

We are working on a POC of druid with hadoop and spark. We are facing with the issue of batch ingesting our data via Overlord’s indexing service MR job.

Here is my use case in abstract:

  1. Batch Ingestible json data is getting processed by our spark cluster in a separate node.

  2. Druid’s Overlord is running on different nodes ( Couple of instances behind a load balancer ).

We want to POST this data to druid’s Overlord via the API call (i.e. curl -X ‘POST’)

Issue here is that, Overlord expects the ingested data file to be present in the same node where it is running as the Path defined in the meta json file is of type “static”. Is there a way, I can give a remote path instead of the local one, as I don’t want to copy this ingested batch data file to the overlord node’s location.

Your input/suggestion shall be very helpful for me.

Thanks and Regards,

Ankur Kapoor

Hi Ankur,
you can store spark generated files in a distributed storage e.g HDFS or S3 and modify the paths in inputSpec for hadoop index task to point to that.