We recently had an increase in the volume of events coming into our druid ingestion stream, which led to an increase in our batch indexing time, despite our segment size remaining the same. Looking at the logs, it seems a majority of this time is spent on the pre-aggregation phase. The number of reducers for the IndexGeneratorJob is always 1, which from reading the code, looks like it is set depending on the segment granularity (if our segment granularity is 1 day, and we are running daily batch indexing job, there will only be 1 reducer). Please let me know if this is not a correct assumption.
I’d really like to reduce the amount of time needed for this batch indexing. One solution I can employ on my side is to pre-process the druid row generated json data; essentially doing the pre-aggregation in a separate map reduce job before handing off to the druid hadoop index task (it seems to me there is no limitation forcing us to do this pre-aggregation with a single reducer). But this seems odd, not being able to make use of the druid built in pre-aggregation. Is there a reason it is isn’t written to run as a separate MR job that could take advantage of more parallelism? Has anyone else encountered this issue and if so, what solution was employed to improve batch indexing time?
We are running druid 0.8.0.