I’ve been experimenting with the HadoopDruidIndexer.
The setup I have is a remote multi-node Hadoop cluster and a single node Druid cluster.
I successfully ran a batch upload on a small amount of data 7mb.
This I did with a recompiled fat jar with all dependencies, etc.
However, when I ran it on 700mb file for an hourly segment, it failed to finish.
I’m trying to understand what resources are necessary in order for this batch job to finish?
Do I need a bigger Overlord? Several overlords? Several middle managers?
Also, I got the data out of Hadoop using hadoop fs -text command, and did the batch indexing service on the text json file.
Again, it was 700mb but then decompressed to about 3gb. It loaded in about 1 hour.
Is it because the regular batch indexing service is not a distributed process and the HadoopDruidIndexer is?
Please enlighten me.