Druid batch indexing large volume of data slow.

I’m using the standard index task to. Index quite a large amount of data.

My current setup is:

1 Middle Manager with: MAX_DIRECT_MEMORY: 3g

with up to 4 Peons: With MAX_HEAP 1g, MIN_HEAP 256m, MAX_DIRECT_MEMORY: 3g.

I’m inserting data 1 day at a time. Processing one day of data takes about 30 seconds on a Peon. I have data going back to ~25 years. Once it’s all into Druid it’s showing as ~ 2.5GB.

Back of the envelope it would take ~3 days to insert the data into Druid. (This is consistent with my testing, it takes ages!).

Is this a normal speed for the index task to operate at? Perhaps it’s not configured correctly? Would switching to hadoop_index likely improve things? What speed increase does hadoop bring over indexing.

Any advice / suggestions welcome!

Thanks,

Richard

Hi Richard,
3 days for 2.5GB of data is quite large.

I guess you might be using lower segment granularity e.g DAY generating large no. of very small segments.

If that is the case changing the granularity to YEAR will help.

also, IndexTask is not optimal for high data workloads and is not great for generating large no. of segments.

using Hadoop Index task will also help in reducing indexing time further.

The hadoop indexer is definitely the best way to index a large amount of batch data. It parallelizes better than the vanilla index task and also has had much more work put into optimizing it. The vanilla index task is mostly intended for small batch loads.

Thanks guys, i’m getting this working now!

Got it working.

The hadoop indexer is alot faster on the same hardware (about 10x faster for me).