Batch Ingestion

Hi All ,

Looking for some details regarding batch ingestion and what is the best practice any suggestion, tips

  • I have some tera byte (double digits) of historical data at s3 which I want to bring at druid .

  • Also, I have some aggregated/pre computed data at key value store (tera byte) .

So, far I have been trying with indexing service to ingest data . However , what is the best way bring all to druid efficiently as initial load? and

We do have spark pipe line in few places ,my plan is to use spark rather Hadoop any reference or any one using similar approach .

Thank you

The current standard way to load TB’s of data from s3 is to do batch ingestion with hadoop using HadoopIndexTask ( , It is what most of the folks use in production to get large set of data into druid.
There has also been multiple community requests for a Spark batch indexer, related work & discussion is here -

and the source code and initial docs on how to use it can be found here -

Hi Nishant,

Thank you for reply , looking at the details of spark discussion thread, seems still very early to go with spark for production pipeline .

~ Biswajit

Hi Nishant ,

I was going thru hadoop indexer , seems like hadoop indexer launches internally in Druid if I’m not wrong. How do I use external hadoop cluster ?? , I’m really sorry if I’m missing anything here .

Thank you .

Place your hadoop configuration files in classpath of your servers running Hadoop.

Described in detail here: