Need advise on replication strategy

We have one month of data at a segment volume (on the historical) of 3.5GB per hour which adds up to roughly 2.5 TB of segments for the given month. We have a segment- and query-granularity of HOUR and have 6 partitions per hour, so 4300 partitions per month.

We serve this data using a cluster of historicals that have a total of 480 CPU cores and 3TB of RAM available for memory mapping.

We currently have set the replication level to 2 and so each segment gets distributed to two different historicals and the total data volume therefore increases to 5 TB.

I’m wondering what the tradeoffs are between replicating on the same tier and keeping the replicas on a separate tier with a lower memory to disc ratio.

I personally believe that putting the replicas onto a different tier aught to increase query performance because the 2.5 TB of data would then completely fit in RAM.

The same-tier-replicas also don’t play well with the local cache and the memory mapping because if we submit the same query multiple times, we should see that the second time around, the query can be served up quickly from cache. But as for each segment partition to be scanned as part of a query, the broker will pick either the original or the replica and therefore always end up with a different set of historicals to serve the query, it will take many rounds of sending the same query until all segments and their replicas have been mapped to memory and are also available in the local query cache.

Therefore I believe that it might be beneficial to put the replicas into another tier.

However, on the other side it is also the case that if a segment is replicated within the same tier, then there are more CPU cores that can potentially serve a query concerning a given segment. So it seems to me that replicating a segment within the same tier would be done in cases when there was an extremely high number of expected concurrent queries hitting the cluster.

Could someone provide some insight into those matters and tell me if my assumptions are wrong or right?

What would be the three most typical setups?

How many users would have to be using a Druid cluster to justify replicating on the same tier?



Hey Sascha,

It sounds like you are generally thinking about things in the right way.

If you have a himem/lomem split like that, then usually the reasons to keep multiple replicas on himem are because you have really high query concurrency for some segments (CPU bound cluster rather than memory bound) or because you want smoother fault tolerance. With only one replica in himem, if one of your himem nodes fails (or is even just being rebooted), then all queries for those segments will be routed to lomem nodes for a time.

On the flip side it’s usually not worth it to increase replication in himem if that means you have to substantially lower the ratio of segments that you can keep in the page cache.

Thanks for the explanation, Gian.

I have a follow-up question:

We now have a hot tier with r3.8xlarge nodes serving one month of data and are setting up a cold tier with i2.8xlarge nodes for serving up another 2 or more months of data.

Currently we have an in-tier replication level of 2 for both tiers.

Given that most users tend to query the last month of data, the cold tier CPU resources would not be utilized much, as they’d be just sitting around idling until a user happens to query older data. Is there a way to facilitate the cold tier better?

I was wondering whether there is a way to configure the cold tier to host the replicas of the hot tier in such a way that if a given query needed to scan more segments that there are CPU cores in the hot tier, the cold tier would “help out” with scanning the other segments and if such a setup doable, practical or even recommended?

I guess, for such a setup, Druid would ideally have to have some sort of cost-planner: If the number of segments to be scanned for a query meant that the hot tier’s CPUs had to take 10 sequential passes, then it might make sense to let the cold tier participate in the scanning if the time for disk-to-memory mapping could be assumed to be taking less time than 10 sequential in-mem scans of the hot tier.

If something like that was configurable, it might reduce the query latency for long-period queries, woudln’t it?