I am working with a large dataset to index using Hadoop based batch ingestion.
My segment granularity is HOUR and I know that even within an hour there would be large number of data rows.
So, I split the files based on a row number limit(1Million rows per file)
When I run the batch ingestion task, the default behaviour is to combine all these files(of the same hour) into a single segment.
This segment would be very large and would cause query latencies.
So, I put up a partitionSpec to limit the number of rows per segment and create hashed shards.
The question is: Can I instruct Druid to create one shard per input file? This way, we can skip running the “determine_partitions” job which effectively causes the batch-ingestion job to run twice as slow.