Recommended number of Historicals for a large data set

Hi,

We have a datasource whose size is aorund 8 TB, and our Druid cluster has 10 historicals nodes of type i3.8xlarge (AWS EC2 instance). What would be the recommended number of historical nodes for the given EC2 type for the 8TB data set. Is there a formula to determine the number of historical nodes needed?

regards,

Vinay

Hi Vinay,
It looks like AWS i3.8xlarge gives you almost 7.6 TB of space per node.

What is the replication you are planning for your data sources? Do you want to keep 1 copy or 2 copies?

How many data sources you have and what is the size of each data source?

What is your ingestion rate for each data source on a daily basis?

Of course , you need to keep some additional space for segment balancing when a node or couple of nodes goes down.

You need to consider all these above factors before deciding number of historicals.

In my experience , I had seen lot of companies start with 15 to 20 percent used storage space and almost 80 percent free storage space for processing and also for future data.

It might seem over provisioning but clusters get filled up pretty fast in real production environments.

I am not sure about your budgeting constraints but if you have some luxury on budget aspect, I would definitely go for little bit over provisioning (like when you start your brand new cluster, your cluster is 20% filled up and 80% available still for future data and other processing) and don’t need to think about it for 2 years at least.

That way , you can focus on your actual business use case which you want to accomplish with druid rather than getting distracted with these infrastructure issues every 6 months.

Hope this is useful.

Thank you.

–siva

Hi Siva,

Thank you for the detailed information. Below are the additional details. Please let me know if below historical configuration needs to be tweaked for better performance.

  • We have 2 data sources in Druid (let’s consider data source names as A and B), wherein data source A size is 4 TB and data source B size is 7.5 TB.

  • The average segments size per day after ingestion for data source A is 50 GB and data source B is 35 GB

  • Replication factor is 2

  • Currently for the i3.8x large Historical, the property druid.segmentCache.locations has maxSize set to 6 TB. With this configuration, on the corrdinator console, I see storage space for segments as 60 TB for 10 historicals (out of which 22.5 TB is used up now. Percentage wise each historical’s 38% space is used up)

With this rate of ingestion, existing data size and replication factor, how many historical nodes would I need to maintain for better query performance?

Regards,

Vinay

Hi Vinay,
We still have 37TB space left.

We are getting roughly 100GB data daily. Replication is 2. So we are occupying 200GB per day -> 2TB for 10 days -> so in another roughly 180 days, we will fill up remaining 37TB just for storage .

Now let’s say we doubled your capacity(from 10 to 20 historicals).

Now let’s work on this how it looks like.

We get 10 more historicals which is another 60TB plus current remaining 37TB -> 97 TB will be total available storage after we doubled from 10 to 20 historicals.

At the current rate(200GB per day), in one year from now , we will fill up 74 TB out of total 97TB which is roughly 75% of space which leaves us roughly 25% for other processing plus new data by the end of the year.

This is assuming that you are NOT marking your old segments UNUSED and you are keeping all your segments as USED.

Based on your ingestion rate, your current 22.5 TB( roughly 11TB without replication ) might belong to last 110 or 120 days (@200GB per day with replication) - roughly 4 months.

Again depending on your budget, If we don’t have to worry about infrastructure issues for at least 1 year and also get good query performance plus ingestion performance , this above scenario looks good to me.

I have one doubt here. When you say roughly 100GB per day, I assumed it is without replication.

But if my assumption is wrong and if it is roughly 100GB per day with replication , then if we have to look good for at least 1 year, then even 5 more historicals should be good enough .

If my assumption is correct and if it is roughly 100GB per day without replication , then if we have to look good for at least 1 year, then doubling your historicals should be good enough.

Hope this helps. Please let me know if I went out of track somewhere and missed any important point.

Thank you.

–siva

Thank you Siva for the detailed explanation. Your understanding is correct. Daily ingested data size is around 100 GB without replication.

Regards,

Vinay