We have one month of data at a segment volume (on the historical) of 3.5GB per hour which adds up to roughly 2.5 TB of segments for the given month. We have a segment- and query-granularity of HOUR and have 6 partitions per hour, so 4300 partitions per month.
We serve this data using a cluster of historicals that have a total of 480 CPU cores and 3TB of RAM available for memory mapping.
We currently have set the replication level to 2 and so each segment gets distributed to two different historicals and the total data volume therefore increases to 5 TB.
I’m wondering what the tradeoffs are between replicating on the same tier and keeping the replicas on a separate tier with a lower memory to disc ratio.
I personally believe that putting the replicas onto a different tier aught to increase query performance because the 2.5 TB of data would then completely fit in RAM.
The same-tier-replicas also don’t play well with the local cache and the memory mapping because if we submit the same query multiple times, we should see that the second time around, the query can be served up quickly from cache. But as for each segment partition to be scanned as part of a query, the broker will pick either the original or the replica and therefore always end up with a different set of historicals to serve the query, it will take many rounds of sending the same query until all segments and their replicas have been mapped to memory and are also available in the local query cache.
Therefore I believe that it might be beneficial to put the replicas into another tier.
However, on the other side it is also the case that if a segment is replicated within the same tier, then there are more CPU cores that can potentially serve a query concerning a given segment. So it seems to me that replicating a segment within the same tier would be done in cases when there was an extremely high number of expected concurrent queries hitting the cluster.
Could someone provide some insight into those matters and tell me if my assumptions are wrong or right?
What would be the three most typical setups?
How many users would have to be using a Druid cluster to justify replicating on the same tier?