So, I presented to my team about Druid and they have a lot of questions: some I couldn’t find the answer to.
Let’s imagine that we have a production cluster: Several historical nodes, a coordinator node, a real-time node, and a broker node.
I understand basic flow: real-time node ingests data, creates segments, push them to deep storage (HDFS) and publishes to
other nodes that segments are available.
First question is, “Does coordinator ensure that all segments in HDFS are loaded into historical nodes?”
If so, “What happens when you have more data in deep storage than you have space on Historical Nodes”? For instance,
let’s say we have 1 TB of data in HDFS and we have 5 historical nodes (each can hold 100 GB)? How does druid handle this:
Does it load as much as it can into Historical nodes?
My other question regards querying. I know that broker receives query, but how does it know which historical nodes
to send the query to. Does it communicate with the coordinator node to get this information or does it simply send blindly to
all historical nodes?
Does historical node ever load data from HDFS to service a query, or does it only load when coordinator tells it (when a new
segment is published to HDFS)?
Lastly, looking at Druid from a high level, I am still not understanding exactly why Deep Storage is needed. If the historical nodes
can be configured to replicate data (already an available system) then why can’t they just be used to hold all of the data?