Batch Ingestion - Avoid "determine_partitions"

I am working with a large dataset to index using Hadoop based batch ingestion.

My segment granularity is HOUR and I know that even within an hour there would be large number of data rows.

So, I split the files based on a row number limit(1Million rows per file)

When I run the batch ingestion task, the default behaviour is to combine all these files(of the same hour) into a single segment.

This segment would be very large and would cause query latencies.

So, I put up a partitionSpec to limit the number of rows per segment and create hashed shards.

The question is: Can I instruct Druid to create one shard per input file? This way, we can skip running the “determine_partitions” job which effectively causes the batch-ingestion job to run twice as slow.

You could achieve something similar by setting “numShards” equal to the number of files you have. The segments won’t correspond 1-1 with files (there will still be a shuffle phase) but that’ll skip the determine partitions job.