Deleting segment by Drop rules - Confusion about how period are matched - Druid 0.6.146

I checked the drop rules at http://druid.io/docs/0.6.146/Rule-Configuration.html

Are the drop rules meant for time before the specified period?

If I specify Period drop rules “period” as P1M then will segment before the 1M will be deleted? Or within 1 month be dropped?

I was checking older discussions and came across https://groups.google.com/forum/#!msg/druid-development/iUuR7nvkYF0/WNr-y_0wGvUJ where I came across the below

Segments match the first rule that applies to it
If you have a period load rule of P1M and a period drop rule of P2M, everything from now to 1 month in the last is loaded, everything from 1M in the past to 2M in the past is dropped.

Say I want to keep segments for 1 month only then what should be my load drop rules?

After reading through https://en.wikipedia.org/wiki/ISO_8601#Durations I was thinking something like specifying a particular period say P1Y for drop and P1M for load. What I am trying to say by this specification is that for the past 1 year I want to keep the segments for 1 month only and drop segments for older 11 months. As there are no dates specified these rules will remain valid say one month from now also. Is that understanding correct?

Also it is dropping from cluster only or are they deleted from permanent storage also by this rule?

bump

bump

Hi Aseem, if you have
LoadbyPeriod P1M

DropByPeriod P1M

in that order

Segments that overlap with any time from now to 1 month in the past are loaded according to the first rule. Segments that don’t match the first rule and overlap with data going 2 months into the past are dropped. Segments older than that will follow what is dictated by a default rule.

A common pattern is to do something like

LoadbyPeriod P1M

DropForever P1M

This means that segments older than 1 month are dropped from the cluster.

Are they deleted?

I just want to keep the data for 1 M then is using loadByPeriod P1M and DropByPeriod P1Y enough? How to actually delete permanently the segments via rules? Not only from the cluster but from the deep storage also.

Segments are deleted from Druid, but remain in deep storage, unless they are explicitly removed from deep storage as a kill task. http://druid.io/docs/latest/misc/tasks.html (search for kill task)

If you want keep data around for 1M, use loadByPeriod P1M and DropForever as rules.

We rarely use the kill task because we practice we see clients frequently want to reload their data and change their retention periods.

I am using the druid HTTP end points for specifying the rules. After sending new rules via the API and getting the rules for the dataSource I thought that the rules are overwritten. But druid is still loading the older data into the cluster. I checked the druid_rules table and there are multiple entries for the rules for the datasource with different versions. Is this how it is supposed to work?

In the database druid_rules table the following is present as payload for an older version

[{“period”:“P1M”,“tieredReplicants”:{“hot”:0},“type”:“loadByPeriod”},{“period”:“P10Y”,“type”:“dropByPeriod”}]

and for a newer version

[{“period”:“P2D”,“tieredReplicants”:{"_default_tier":0},“type”:“loadByPeriod”},{“period”:“P10Y”,“type”:“dropByPeriod”}]

But the new rule does not seem to be taking affect as the data is still being returned by queries.

Is deleting the rules from the table safe?

Looking at https://github.com/druid-io/druid/blob/druid-0.6.146/server/src/main/java/io/druid/server/http/RulesResource.java it seems that the understanding is correct. But then why the old data is still being loaded into cluster? Will look further if can find the reason.

Any help would be appreciated.

Hi Aseem, can you please post the output of all rules according to http://druid.io/docs/0.6.146/Coordinator.html?

I’m quite confused looking at the rules you’ve set up as to what you are trying to do.

Hi Fangjin,

I am right now working on dropping the data older than 1 hour from the cluster. I have set the rule as below:

[{“period”:“PT1H”,“tieredReplicants”:{"_default_tier":1},“type”:“loadByPeriod”},{“type”:“dropForever”}]

According to this rule, only past 1 hour data should be taken into consideration for any druid query. The problem is I can see that through this rule ‘used’ field in segments table is set to 0 but I can still query the old data.

Can you tell me the possible causes for this kind of behavior?

Your quick reply would be a great help.

Thanks,
Jvalant

The segments should be dropped the next time the coordinator runs. If this is not happening, do you see exceptions in your coordinator logs?

Hi Fangjin,

There is no exception in my coordinator logs. My confusion is as below:

I have configured my datasource with realtime node and I want to retain only 1 hour of data at given time. What should be the steps or configuration should be done to get such behavior?

I have set the rule to drop all the data older than 1 hour. According to that rule, it is marking the segment as unused. As my datasource is configured for realtime node, what should be the configuration so that it drops the data older than 1 hour ?

Your suggestions would be so helpful.

Thanks,
Jvalant

Jvalant, there is no way to drop data from a realtime node until it hands data off. Everything I mentioned is for historical nodes. What exactly are you trying to do?

Hi,

We are also facing similar issue, our Historical nodes are running out of disk space. And we want to set 1 month retentions for our segments.

We tried to update the datasource rule as per the docs but its not updating:
http://druid.io/docs/latest/operations/rule-configuration.html
http://druid.io/docs/latest/design/coordinator.html

vikasrana@dr-crd1:~$ curl -XPOST http://dr-crd2.abc.com:8081/druid/coordinator/v1/rules/cerberus_analytics -d ‘[{“period”:“P1M”,“tieredReplicants”:{"_default_tier":0},“type”:“loadByPeriod”},{“period”:“P10Y”,“type”:“dropByPeriod”}]’

vikasrana@dr-crd1:~$ curl http://dr-crd1.abc.com:8081/druid/coordinator/v1/rules/cerberus_analytics

Also checked mysql the rule table, it has _default rule:

Hi Vikas, what version of Druid are you using?

Have you tried updating the rules from the coordinator UI?

We are using latest Druid version, 0.9.1.1, and Seems like it worked from UI.
vikasrana@dr-crd1:~$ curl http://localhost:8081/druid/coordinator/v1/rules/cerberus_analytics

[{“period”:“P20D”,“tieredReplicants”:{"_default_tier":1},“type”:“loadByPeriod”},{“period”:“P10Y”,“type”:“dropByPeriod”}]vikasrana@dr-crd1:~$

vikasrana@dr-crd1:~$

vikasrana@dr-crd1:~$

vikasrana@dr-crd1:~$ curl http://localhost:8081/druid/coordinator/v1/rules/event_sdk

[{“period”:“P1M”,“tieredReplicants”:{"_default_tier":1},“type”:“loadByPeriod”},{“period”:“P10Y”,“type”:“dropByPeriod”}]vikasrana@dr-crd1:~$

Also its dropping segments from historical nodes, making used=0 in MySQL. And all data is present in S3.

From Co-Ordinator logs:

2016-08-26T05:56:59,293 INFO [Coordinator-Exec–0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: Segments Moved: [1]

2016-08-26T05:56:59,293 INFO [Coordinator-Exec–0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Assigned 0 segments among 4 servers

2016-08-26T05:56:59,294 INFO [Coordinator-Exec–0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Dropped 11 segments among 4 servers

2016-08-26T05:56:59,294 INFO [Coordinator-Exec–0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Moved 1 segment(s)

Thanks for the quick help and guidance. Really appreciated, cheers !!

Vikas

Hi Vikas,

When segments are marked as used=0 in the metadata storage, they will be dropped from the Druid cluster. To completely remove a segment, you must use the kill task:

http://druid.io/docs/0.9.1.1/operations/rule-configuration.html

By completely remove a segment, I mean wipe the metadata entry and also remove it from deep storage.

Hi,

We’ve daily batch ingestion taskwhich loads the data, Over a period of time,we had lot of segments in the disk so we set up load and driop rule for 5D.

Thinking this will delete all the segments “created” before 5days. But it deleted everything till data…Do rules dont work on segment creation date?

Our segments granularity is by month…SO it kept all the segments for current month but deleted everything else.

I’m confused on how the rules work? We had a kill task to kill the segments using coordinator…What does durationToRtetai do in this case?

Please help

The rules apply to the date range for the segment, not the time it was created. So if you have a drop rule for 5D and create a segment with a time range of 7D ago, it will be dropped immediately.