Why does Druid recommend using indexing service instead of Hadoop Indexer

This tutorial describes two ways of loading batch data: indexing service and HadoopDruidIndexer (I think it should be CliInternalHadoopIndexer in druid 0.8.2).

It also describes advantages of different methods.

My situation is I only need to ingest batch data. And I am now running CliInternalHadoopIndexer on our Hadoop Cluster.

Configuring indexing services could be a challenge because we need to ingest hundreds of segments all at once sometimes.

However, in a later tutorial, we found “The HadoopDruidIndexer still remains a valid option for batch ingestion, however, we recommend using the indexing service as the preferred method of getting batch data into Druid.”

My question is:

  1. Why does druid recommend using indexing service (which we never use previously)?

  2. Do you plan to disable indexing on Hadoop cluster in later releases?

Thanks.

The indexing service provides a lot of extra guarantees when running multiple tasks, especially when running real-time and batch tasks.

If you’re just using batch tasks and have your own external synchronization (ex: something to make sure only ONE batch tasks run for any given dataSource for a particular data-time interval) then the stand alone CliHadoopIndexer should be fine.

There have not been any discussions to get rid of the CLI indexer, and I think a few of the Devs would raise a stink about getting rid of it because of how easy it is to slap data into a druid cluster. That is the method I use during most of my development. There has been talk of getting the ingestion methodologies more unified, but such an approach would have to provide either compatibility with the CLI hadoop indexer, or provide very similar functionality, and such a feature is still in the “wouldn’t that be nice to have” phase.

Running the indexing service also simplifies the number of parameters that have to be passed.
Whatever submits an index task does not need to know about hadoop configuration or internal druid metadata storage configuration, unlike when running the standalone hadoop indexer.

Hi, I have a question about this statement regarding usage of the Indexing Service (http://druid.io/docs/latest/ingestion/batch-ingestion.html):
“Batch Ingestion Using the Indexing Service Batch ingestion for the indexing service is done by submitting an Index Task (for datasets < 1G) or a Hadoop Index Task.”.

Do I have to use HadoopDruidIndexer for datasets > 1G instead of posting tasks to the overlord node (Indexing Service)? What is the reason for this limit of 1G and what happens if the datasets are larger or much larger?

Regards,

Martin

Hey Martin,

That 1G guideline is for the regular “index” task as opposed to the “index_hadoop” task. It’s not a hard limit, just a guideline. The reason for it is that the regular “index” task implementation is not very efficient when it needs to create multiple segments, which will likely happen for data > 1G in size. Indexing will work but it will likely be slower than you expect. The hadoop based task scales efficiently to a large number of segments.

Shameless plug of the spark index task in development: https://github.com/metamx/druid-spark-batch which also handles large datasets well.