Ingestion Spec for consuming file on a remote HDFS cluster

Hi,

I need some help regarding ingesting a remote file on HDFS cluster. What needs to be passed in ingestion spec for the same?

Hi Shubham,

You can follow the documentation here. It is very detailed and you will find everything you need there. Make sure necessary ports are opened between your druid cluster and the remote HDFS cluster (namenode port 8020 and datanode port 50010).

If you want to use HDFS as deep-storage, set property “druid.storage.type” as “hdfs” and “druid.storage.storageDirectory” to the hdfs location you want to set as deep storage. Make sure the necessary permissions to write data to that location is also there for the druid user.

Thanks!

Manu

Hi Manu,

I’ve followed this and already have ingestion spec for local HDFS file.

However, in my particular scenario, I’ve to check if I can ingest the same file from a remote Cluster.

Is it that I only have to add “path”:“hdfs://server-ip:8020/loc/” to be able to ingest from a remote server?

Hi Shubham,

Sorry I missed one important thing. You have to copy the “hdfs-site.xml” and “core-site.xml” files from your remote HDFS cluster to the Druid cluster location “conf/druid/_commons/”. This will help druid to get details of remote HDFS cluster. Then you can simply specify “paths”:"/loc/" in your ingestion spec.

You can get some additional details here as well

Hi Manu,

I get what you’re saying, but I don’t want to use resources of that cluster. My druid is installed on a Hadoop Cluster, and i want to use this clusters resources. However, I only want to fetch contents of a file from a remote server.

The changes that youre suggesting, I think, will also consume resources to index data on the remote cluster.

Makes sense. Unfortunately, I am not sure how to make that work. As far as I know, you can move the files to your local hdfs cluster where druid is installed and run the ingestion job from there.

Yeah, I’ve been able to ingest from local HDFS.

Also, I did just add “path”:“hdfs://server-ip:8020/loc/” in the ingestion spec. It seems it is able to connect though its throwing an error “java.lang.ClassNotFoundException: org.apache.hadoop.hive.common.io.DiskRange”. Any idea about this?

Are both of your hadoop clusters running the same version ? Please check if tip #2 in
https://druid.apache.org/docs/latest/operations/other-hadoop.html helps

Thanks!

Manu

Thank you so much. Yes, you’re right. Hadoop versions are different.
I’ll try to figure out how to proceed now.

P.S. I’ve been using those properties mentioned in that link.

So here’s the update.

Simply mentioning “path”:“hdfs://remote-server-ip:8020/loc/” works. It takes the data file from remote server, while resources of current server is being used.

As told by Manu, I was facing issues because of Hadoop version mismatch where i did not had some Hive classes present.

Followed https://druid.apache.org/docs/latest/development/extensions-core/orc.html#hadoop-job-properties doc, in Hadoop Properties, they’ve mentioned steps to include Hive Jars.

Was able to execute after that.