Automatic deletion of segments from deep storage (aka Kill Rules?)

Good afternoon,

in our team we stumbled upon a problem of running out of free space on HDFS and therefore thought of a feature that we would want to have in Druid.

What we want to have is the ability to setup rules not for just deleting/dropping of segments but also for killing (i.e. completely removing them from the deep storage).

Currently, we see two options here:

  1. either implement it outside druid using Kill Task (http://druid.io/docs/latest/ingestion/tasks.html) and schedule its recurring execution,

  2. or implement it within druid.

And the second option is the reason why I decided to bring up this question here. I would want to know what your suggestions and general thoughts on this would be. I’m mainly interested in answers to these questions:

Maybe there is already work in progress?

Maybe there is a solution already?

If not, where would you recommend to start looking in the code? Or is it feasible to implement something like this at all?

One problem with this that I already see is that since there are options of how to organize the deep storage - local disk, hdfs, s3, google cloud storage) - this can be quite a large-scale task to solve. Or maybe I’m wrong and implementing how druid segments are deleted for all the storage providers is the least difficult part.

Anyway, I would appreciate any thoughts you have on this topic.

Thanks,

Eugen

Eugen,

This is already implemented. You have to turn it on by setting a
config and then potentially setting the whitelist of data sources you
want it to apply for. You can read more at

http://druid.io/docs/latest/configuration/coordinator.html

Search for the config

druid.coordinator.kill.on

--Eric

Good morning Eric,

thank you very much for pointing that out.

Best,

Eugen

Hi Eric,

could you please help me with the druid.coordinator.kill.durationToRetain setting. It must be set if druid.coordinator.kill.on is also set to true.

I cannot fully understand how it works. Does it apply globally to all data sources? For example, if I had a data source with drop rule P1M and P2M for durationToRetain setting - when would the segments be killed? 2 months after they were dropped or when they are 2 months old? Or maybe I have completely misunderstood how it works. Also another confusing thing is that the default value for the setting is invalid.

Thanks,

Eugen

Good afternoon colleagues,

my findings after using this configuration:

druid.coordinator.kill.on=true

druid.coordinator.kill.durationToRetain=P30D

druid.coordinator.kill.maxSegments=1

and killAllSegments set to true in coordinator’s dynamic configuration (http://druid.io/docs/latest/configuration/coordinator.html) are that coordinator scheduled a kill task per data source that are disabled or have segments that are outside the boundaries of specified drop rules. Very likely that I understand correctly that durationToRetain option means period to retain data after the end of the data source’s period specified by drop rules. However, considering empirical data I have now (for a data source with the drop rule P1Y with existing segments of more that 1 year old I got a kill task scheduled that removed a segment that spans data in this interval: 2016-03-08T02:00:00.000Z_2016-03-08T03:00:00.000Z), I cannot conclude that for sure.

Which is good. And now my idea is to find out how often coordinator schedules these kill tasks.

By default it does that once every 24 hours.

Hey Everyone,

I have been facing the same issue and also, have scheduled the killTask periodically. The additional issue I’m facing now is Historical deletes the segments from the segment-cache since the killTask disables the datasource. Is there any way to prevent it in killTask. Is the issue also persists in dynamic deletion through coordinator ?

Hi Akul,

What behavior are you hoping to see? From what you said, the behavior sounds right - historicals should be clearing out any dropped and/or killed data.

Hey Gian,

Thanks for the early reply. By clearing segment-cache on historical I meant all the segments of the datasource is deleted by historical. For example, if I have data for a week(2018-03-21/2018-03-28) in the datasource and I fire killTask for just 1st day (2018-03-21/2018-03-22). The killTask is successfull and deletes data from deep storage as well. But the historical clear all the segments and re-downloads all the left over segments (2018-03-22/2018-03-28).
This was just an example but it is creating problem as I have 1 month of window and my historical downloads complete month's data again which again is time consuming and broker doesn't provide results unless the data is downloaded by historical.