Configuration to not load all the data

Is there a way to tell tell Druid not load all the S3 data…?

I am tired adding volume to the historical node whereas we are no longer interested in data more than 5 years old

Old question and since no one replied, putting my answer.

It is in Druid Web Console, Click on DataSource and then define the policy.

Hi,
This tutorial might help you

https://druid.apache.org/docs/latest/tutorials/tutorial-retention.html

Thanks,

–siva

Excuse me if this might seem obvious but IMHO the docs are a bit vague in that regard so bear with me:

If segments match a time interval with action "drop", that means that the segments are dropped from the segment cache of the historicals in that specific tier, as opposed to being dropped from deep storage altogether? So deep storage grows infinitely until an explicit kill task is run?

Regards, Felix

I have the same question actually, but now I assume that you are correct Felix.
I have now also a lot of stuff in my deepstore, but only a couple of months in historicals.

So I guess I need to write a cleanup script as well for the deepstore

Edwin

Actually, now I was reading the docs a bit better and it looks like the Deepstore should be able to be cleaned by Druid.

I am using Google Cloud Storage, but it should be the same I guess for other Deepstores you use

My settings for the segments:

This is not enough to delete segments that are no longer available I assume now, because the GCS still containes older data.

Edwin

Edwin,

my understanding from the findings in this thread is that after you drop data by retention rule, it will be deleted from the segment cache only. It stays in deep storage and there's an option to "re-load data by interval" in the datastore view which you can use to make the Historicals fetch the segments from that particular timeframe.

After the data has been dropped you can instruct Druid to delete the segments of a particular datasource that aren't active on any Historical by issuing a kill task (see screenshot). I think that might be the option you're looking for.

What I haven't tested is if Druid immediately drops the segments again after reloading them in case they match your drop rule. Would seem like consequent behaviour to me at least.

Bildschirmfoto 2020-01-22 um 19.21.49.png

Hey all,
I’m (a bit) biased, but I believe this blog post can really help to understand Druid’s data retention and deletion - https://medium.com/nmc-techblog/data-retention-and-deletion-in-apache-druid-74ffd12398a8.

I wrote it with a colleague just a few months ago, and we shared some of our interesting findings.

For example, using a correct setup of drop rule(s) and kill task(s) on just one of our data sources, we reduced the amount of used storage from ~365TB to ~15TB and our AWS S3 costs from ~$8.3K/month to ~$350/month.

You can find a few more tips in that post, I hope you’ll find it useful :slight_smile:

Itai

Thanks all. Good discussion after a long time.

Like I said in my post earlier that I did it by defining the “load/drop” data policy from Druid Console.

Good explanation Itai, thanks.

Sure, glad you liked it :slight_smile: