Does all your data need to fit within your historical nodes? (historical vs deep storage)

To put it another way, is it possible in Druid to have deep storage be much larger than the total sum of historical node storage? How is this done if so?

I was operating under the assumption that deep storage could be much larger than what you have for historical nodes, and that the broker/historical nodes would work together to query the data correctly. Is this true? So if I have 5 historical nodes, with a total of 1TB of storage, and the data I’m querying over is, say, 10TB, will druid pull in the segments and the eject them as needed? The way I have it set up in my cluster seems that a historical node just loads some chunk and then holds on to them…

Thanks!

Hey Ron,

You actually do have to have enough disk space on your historical nodes to cache all of your data. Druid pre-populates the segment cache on each node for performance reasons (queries don’t have to wait for data transfer from deep storage to Druid nodes).

note that is DISK space, not MEMORY. Druid will page in resources off disk as it needs them.

Deep storage only acts as a permanent backup in Druid and is not used in the query path.

Hmm. Are there ways to specify what segments to load into historical nodes from deep storage? It seems like an odd choice to require historical nodes to be able to hold all the data - isn’t deep storage just a plain old backup, then? Or is it that the historical nodes must hold all the data for the time range of the query being processed?

(and thanks for the quick responses!)

Ron

Hi,

You can configure “rules” to decide what gets loaded on historicals. Pls see http://druid.io/docs/0.8.0/design/coordinator.html#rules

– Himanshu

The purpose of deep storage is that you can quickly recover even if you lose your entire cluster. When Druid was first developed, it ran in AWS, which was fine most of the time, but occasionally you could lose half your cluster if Amazon was having a bad day. Having segments backed up in deep storage means even if that happens, historical can download segments from deep storage and quickly recover.