HadoopDruidIndexer batch

I’ve been experimenting with the HadoopDruidIndexer.

The setup I have is a remote multi-node Hadoop cluster and a single node Druid cluster.

I successfully ran a batch upload on a small amount of data 7mb.

This I did with a recompiled fat jar with all dependencies, etc.

However, when I ran it on 700mb file for an hourly segment, it failed to finish.

I’m trying to understand what resources are necessary in order for this batch job to finish?

Do I need a bigger Overlord? Several overlords? Several middle managers?

Also, I got the data out of Hadoop using hadoop fs -text command, and did the batch indexing service on the text json file.

Again, it was 700mb but then decompressed to about 3gb. It loaded in about 1 hour.

Is it because the regular batch indexing service is not a distributed process and the HadoopDruidIndexer is?

Please enlighten me.

–Johnny Hom

If you already have data in Hadoop, it’s best to use Druid’s hadoop indexer. That can be done with the standalone HadoopDruidIndexer or with the “index_hadoop” task. The plain batch “index” task is woefully inefficient and really only works on quite small amounts of data. The hadoop indexing methods can scale out to use as much capacity as you have available on your Hadoop cluster.