Hadoop batch ingestion number of input splits


I have recently started using Hadoop based batch ingestion on Druid and ran into some issue. What I want to understand is how Druid determines how many input splits to create for an MR job. Is there any setting in ingestion spec or elsewhere through which I can control the number of input splits and mapper tasks?

I see that this


talks about maxSplitSize, however, there is no mention of how to use it for normal batch indexing using Hadoop.

I am trying to ingest 306 JSON files of relatively small size not more than 1mb each and the ingestion takes more than 30min. I don’t have any high cardinality dimensions and uniques. All I see is that the Druid is generating 306 input splits and launching 306 mapper tasks, this could be taking a lot of time. So I want to understand how the creation of input splits is controlled. Can somebody please throw light on this?


Hi Vijay,

By default Druid uses the regular hadoop TextInputFormat, which IIRC does one split per file. You can also set Druid’s “combineText” option to switch to a combining format, which will make fewer splits by combining some files into a combine split.