I have some questions regarding hash-based partitioning and its impact on query performance.
I am running Hadoop ingestion with hash-based partitioning for my data. The total size of the data to be ingested varies between 15 - 28 GB and the segment size varies from 1.5 - 3 GB(about 9-10 million rows). When I set the targetPartitionSize to 5,00,000 rows. the number of shards being created is between 19-21 and running time for ingestion is 8-13 mins.
So my questions are :
How do I figure out the optimal shard number for my scenario such that the ingestion task takes minimum time to run?
What is the correlation between increasing the number of shards and query latency?
How do you figure out a balance between a low-enough running time for Hadoop ingestion as well as an acceptable query latency?