Thank you for the response.
I was hoping that druid can be configured to load data on historical nodes on demand.
My use case is one where we will have > 10TB of data over a year and will accumulate year over year. Last month’s data (~ 1TB) needs to be available for querying on a fairly regular basis and hence needs to be loaded in advance on the historical nodes.
The rest of the data (> 10 TB) will be queried only once in a while. I was hoping that druid will allow some sort of an on-demand cache behavior for querying this data, where historical nodes could be told to load this data by the coordinator based on an incoming query and the data will time out after a configurable time (for example when the coordinator applies the rules next) from the historical node segment cache.
Given that this is not the case what is the recommended deployment configuration for dealing with this sort of usage pattern?
Loading all the data on historical nodes local storage is not a good option given the cost of local storage vs deep storage. Sounds like that is required? Tiering historical node clusters could help a bit but doesn’t quite address the issue.
Druid attempts to address this use case with tiers of historical nodes. Add a couple of cheap nodes with slow cheap storage for your really old data on a slow tier.
We have same use case that Rahul mentioned. I do not see how slow tier would help, if slow tier nodes have to store data locally and cannot rely on deep storage to retrieve segments on-demand. Am I missing anything? Thanks for the help.
I was merely trying to point out that you can configure a slow tier with different hardware specs to serve less important data. You are right that you can’t answer queries directly from deep storage.
EG for historicals in the slow tier, you could purchase the cheapest local disk type available.
Thanks Kyle. I wonder if it is something that Druid would like it to be a feature? We won’t mind to do the DEV work if it fits in Druid product roadmap. Would be great to hear comments from other Druid committers / maintainers. Thanks.
IMO, this feature makes sense in theory but would be challenging for some reasons,
Druid’s query operators are all implemented to work on memory-mapped data, and it would be a big effort to have them work on streams of data from deep storage.
You could imagine instead, having Druid download the deep storage data locally, memory map it, query it, and then delete it, but this would probably have a lot of overhead.
You could also imagine doing a cache for (2), which actually might work although there could be a lot of thrashing.
If one of these three approaches is an exciting feature for you I encourage bringing it up on the Druid dev list (email@example.com). You can subscribe by emailing firstname.lastname@example.org.
Got it, thanks. Does metadata store keep track of segments on deep storage? I admit I don’t fully understand how data on deep storage is tracked and how indexing works in Druid, but a naive thought would be to keep track of segment locations and indexes of data stored on deep storage and load only that data when a query needs to access that data. It will probably overload meta store and may have to have a different meta store for deep storage. Please feel free to correct me if I got it completely wrong. Thanks.
Yes Druid metadata store keeps track of the segments in the deep storage. However for druid to query this deep storage during the query time will add a lot of overhead a) reading from deep storage b) memory map it c) query the same and d) then delete the local segment file. These would add a lot of latency to the query. Like Gian pointed out this implementation could be very challenging.
Yes, Druid keeps track of segment locations on deep storage using the metadata store. It could in theory use this to load data on demand. You could imagine putting Druid Historicals in a mode where they “swap” segments in and out of deep storage rather than downloading everything locally ahead of time. This would be like my suggestion (3) above. If it didn’t thrash too much it could work.
Btw, one reason we haven’t implemented this yet is that the thought was that storage is cheap, and compute is expensive, and most of the cost of running Druid Historicals is compute anyway. So it shouldn’t be too costly or burdensome to require enough disk to store any of the data you might want to query. This assumption might be invalid if there is a lot of data that you are not normally querying, but would do so from time to time.
If you want to continue this discussion I definitely suggest bringing it to the dev list. It’s the most appropriate forum for discussing future evolutions of Druid’s capabilities.