Understanding historical node disk space requirement

Hi All ,

I have been experimenting with druid lately and read something like below :

**That Historical Nodes loads all of the data in deep storage to it’s local cache, to serve queries ! **

Does this mean if I have a new data set loaded in to druid , say the size is around 3TB , will the historical node load whole 3TB into it’s disk ?

which means I should need historical node with such high disk spaces ??

Any suggestions / thoughts here ?



Hey Anoosha,

A druid historical node will load data based on the load rules you specify in the coordinator. If your cluster has more than one historical node then the data distribution is controlled by the coordinator again where every historical node loads data based on what is present in the /loadqueue path(populated by coordinator).

Specifically, if your load rules are such that it has to load the entire 3T of data, then yes it will try to load all that data but data loading would fail if you run out of disk space.



A noob question here , What if my load rule is “interval” based , and I give specific interval where let’s say 100GB of data will be loaded out of 1TB.

Then what about the remaining data …is it not available for query ? doe this mean only 100GB if data is available in historical to query , and if query anything apart from that interval , druid will reach out to deep storage to fetch the data ?



Druid doesn’t pull the data from deep storage in the query path.

So does it mean ? if I have 1TB data loaded and I am querying 500GB worth of data at once , will the historical node load whole of 500GB data in to it’s disk ?



Historical processes do not load the data from deep storage when responding to a query, instead they read pre-fetched segments from their local cache/disks. I recommend you to read the Druid architecture http://druid.io/docs/latest/design/index.html which will help you understand things better.