Speed up hadoop ingestion for a single hour

Hello,

We are running the hadoop based ingestion locally (no map reduce cluster) because we found that was the most straight forward way to get parquet into druid with our version, (druid 10).

We currently generate 8 or so files into s3 every hour and load it up. We notice the middle manager has 1 thread that is highly CPU bound that is loading the entire hour. Loading an hour of data takes 1 hour and 20 minutes for us.

I have contemplated a few ways to solve this. In one scenario we can have our processes generate files for 10 minute windows and launch 6 processes at once.

This seems backwards is there any way to scale up the load time?

Thanks,

Edward

We currently generate 8 or so files into s3 every hour and load it up. We notice the middle manager has 1 thread that is highly CPU bound that is loading the entire hour. Loading an hour of data takes 1 hour and 20 minutes for us.

The middle manager itself doesn’t really do much work for ingestion, the bulk of the work will be performed by the worker process for the ingestion task, so you could try increasing the memory allocated to the worker processes (druid.indexer.runner.javaOpts).

I haven’t tried this myself, but maybe you could try increasing parallelism in the LocalJobRunner by setting stuff like mapreduce.local.map.tasks.maximum or mapreduce.local.reduce.tasks.maximum, or other related hadoop settings.

If it’s feasible I think making segmentGranularity coarser can also help speed up hadoop indexing tasks.

I have contemplated a few ways to solve this. In one scenario we can have our processes generate files for 10 minute windows and launch 6 processes at once.

That would work, you could run a compaction job later on if you wish to consolidate the segments created by the separate tasks.

Thanks,

Jon