Realtime vs Historical performance

Looks like we might be using Druid in our upcoming product, after the past 6 months of me doing proof of concepts on it. We’re looking at some edge cases for deployment and I’m wondering what the performance of a Realtime vs Historical node on the same data might be. Are they basically the same if there’s no ingestion? The edge case concerns a situation where we wouldn’t want the effective doubling of storage needs because of deep storage, and are wondering what the ramifications are of not using deep storage. The other option is to remove segments from deep storage after a historical node picks it up.

Thanks!

Any thoughts?

Because realtime nodes do both write (appending data) and read, they work with a much smaller data volume than if you used a historical node (which does reads only). Realtime nodes are designed to periodically hand off data to historicals. Deep storage is meant as a permanent backup to data and you will be able to recover your Druid cluster in the event of total cluster failure, or during data center migrations. Generally, storing segments is S3 or HDFS should not be expensive compared to the cost of running a cluster.

Turns out two petabytes of data is a bit more expensive than a single petabyte, so it could be a big selling point for some of our customers. Like I said, it is an edge case, and so I want to know the full ramifications of it.

Lets say you have a Realtime node that isn’t processing any new data at all, just hosting, say, 100 GB of segments. Would it respond to queries at the same speed as a Historical node hosting the same 100 GB of segments?

Thanks!

Ron

Hi Ron,

It will be slower as there are differences in the code paths. RT nodes are not designed to replace historicals. You can colocate both nodes on the same box if you want to avoid the cost of running dedicated historicals.

Hey Ron,

Realtimes are not really intended to be run on their own forever without doing handoff to historicals. I think the issues you’re most likely to run into are,

  • There’s no way to drop data other than shutting the node down, removing some files, and starting it back up

  • You won’t have redundant storage in case of server failures

  • Realtime segments on disk are only merged when they’re handed off to historicals, so storage efficiency and query performance will suffer a bit due to running with unmerged segments