Recommended way doing batch ingestion

Hi,

Just try to clarify what is the recommended way to do batch ingestion for Druid. Hadoop indexer vs. indexing service. We need to do batch load whenever we need to backfill a failed realtime indexing job (by Tranquility) or some late arriving messages. Our data size is about 60GB per hour.

I’m wondering that what is the recommended way of doing batch ingestion? I recall in earlier document, indexing service is only recommended for small data size (1GB), larger batch the Hadoop indexer is recommended. Now I read the document http://druid.io/docs/latest/tutorials/tutorial-loading-batch-data.html, it says “the HadoopDruidIndexer still remains a valid option for batch ingestion, however, we recommend using the indexing service as the preferred method of getting batch data into Druid”.

I’d like to check that it is true that indexing service is preferred over hadoop regardless of data size, which makes sense to me (if you can do realtime indexing with indexing service, batch indexing should work with the same set of nodes)?

Thanks

Shuai

Hi Shuai,

it is highly recommended to use the overlord (indexer node) to do batch ingestion. This will make thing way easier to manage.

In fact overload can be used to do realtime and batch, it can be used to track all your jobs completion and make sure that indexing is done or in progress, you will have to deal with one node instead of spawning your self many druid hadoop processes. indexer node collects the logs of jobs…

so the bottom line use overlord node to index your data.

Hi Shuai,

My response might be a bit to late, but I’m posting it for future reference anyway.
The documentation regarding the possible options for batch ingestion is really confusing and I also had a hard time to figure everything out properly.

Based on my understanding there are the following ways to ingest data via batch into Druid (@Druid devs: please correct me if I’m wrong):

  1. Stand-alone Hadoop Druid Indexer (which is just a M/R job you submit manually)
  2. Indexing Service

When using the indexing service, you again have multiple options:

  1. Run all Indexing Service components together (local mode)
  2. Run Overlord and MiddleManagers separately and let Peons do all the work (remote mode)
  3. Run Overlord and MiddleManagers separately and let Peons delegate the work to Hadoop (also remote mode, but with Hadoop configuration files in the Classpath of MiddleManagers)

I hope this makes it more clear what the available options are.

Best regards,
Daniel

Daniel, very helpful response. 2.3 sounds reasonable.

Ideally we’d like to do batch processing on the same hardware and same software (no Hadoop) so we do not have additional dependencies. I guess the concern here for the standalone indexing service to handle large size data is the poor performance without Hadoop. Would appreciate if other folks who happened read this thread can share your own setup and what you think works best for your usecase