We have a use case that
generates a significant amount of data every day
most queries will target the past week’s worth of data
data that is older than one week will be rarely but consistently queried
a single query will only touch <= 1 day’s worth of data
Ideally, our druid solution would be as follows:
One tier of historicals would be responsible for the “current” data (< 1 week old), keep all of their assigned segments in the segment cache, and be able to service queries quickly.
Another tier of historicals would be responsible for “backdated” data (> 1 week old). These historicals would each have a larger slice of segments to keep track of, and do not keep all of the segments in the segment cache. These older segments would stay compressed in deep storage most of the time, and if a query needs backdated data, it will pull the data from deep storage to service the query.
Queries that target backdated data would certainly take a large performance hit, but since these are rare, it is acceptable. Additionally, once a backdated segment is loaded, it would stay in the “backdated” segment cache until evicted.
At first, we thought that druid supported such workflows, but after digging through the docs, it wasn’t clear if they are possible with druid’s current version. We’re concerned about the resources it would take to keep all the backdated segments serviceable, when its is unlikely that a query would target them.
Is a workflow like this currently possible, that is, on-demand loading of deep storage segments for querying? And if not, how hard would it be to implement something like this?
Just curious, what’s stopping you from keeping the same segment data in local storage? Is it absolutely important that the segments not be in local storage (of historicals) but in deep storage?
We have a similar use case. Size of data over time would be in peta bytes, however 99% of the times queries would be on smaller set of recently received data.
Total cost to store all data in local storage and run a Druid cluster with hundreds of nodes would be astronomical. We would instead like to keep only relevant data in local storage and move rest of data to cheaper deep storage like S3. And load data from deep storage on demand, only when a time series query request such data. Appreciate suggestions! Thanks.
I know you could do that from load rules. You can have a drop rule for any data older than specified period so that way you can make sure that the segments you have is only for the period/interval you need.
Now for the load part, this is the only idea I have right now. Not sure if there is a native support for druid on this. I see two cases -
If you want older data for a pre-defined time of the day, you can set up a simple crown based service to write the load rule, execute them and the drop segments when you no longer need it.
If you need to do that during the query time, you can write a wrapper service around the brokers so you do the loading rule part exactly as previous step, just that you do that through the service instead of a cron.
Loading segments from deep storage will put some stress on your historicals so the real time queries might take a performance hit (there probably is a way to mitigate this; druid experts probably know). And also, a REST call to query data is doing more than querying (for #2 above); Not sure if that is good REST architecture.
Thanks. #2 is what we expect, however we think Druid would be much more effective as it would know which segments to load in right order and in map-reduce fashion to complete the query. Basically avoid loading large number of segments, which we will have to do if done as a wrapper service.
Sounds like it’s not a feature available in Druid? Highly appreciate comments from Druid-experts and committers! Thanks.
I wonder if it is something that Druid would like it to be a feature? We won’t mind to do the DEV work if it fits in Druid product roadmap. Would be great to hear comments from other Druid committers / maintainers. Thanks.