Can someone suggest a way to ingest data in druid using the dataproc job. How do we create the jar file or passing multiple jars . I wanted an equivalent command for java ```
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_dir> org.apache.druid.cli.Main index hadoop <spec_file>
I wanted to solve the same using the gcloud dataproc jobs submit hadoop command. if someone has tried this successfully, please guide me.
Welcome to the Druid Forum. In order to ingest data using your Dataproc cluster, you will have to submit an Index_Hadoop ingestion spec to your druid cluster
This may give you more information:
Yes, This I know , But how do I submit the command to dataproc. If I want to use the command line itself. Cna someone throw light on this. Like the equivalent dataproc command. Pushing Segments to Google Storage from Hadoop Batch Indexer - #2 by Nishant_Bangarwa2 something like he has done " gcloud dataproc jobs submit hadoop --quiet --cluster $cluster --project=ad-veritas --jar gs://forpeter/druid_build-assembly-0.1-SNAPSHOT.jar --region=global – io.druid.cli.Main index hadoop gs://forpeter/wikiticker-index.json" . So, how do I make this jar.
I am not sure what the ask is here. If it is to ingest data using a Hadoop cluster, all you need to do is submit the ingestion spec and Druid will spin up the job. You don’t have to submit the job to Hadoop. The parameters you mentioned above will be part of the ingestion spec.
But how I do I configure the job to submit in dataproc ?
See if this thread helps achieve what you are trying,
They have not specified the exact method call the ingestion job in dataproc
Here is a link with the arguments that can be passed in gcloud dataproc jobs submit hadoop