Programmatic data deletion

Hi,

For deleting any specific data, reindexing tasks needs to be submitted to druid by specifying filters according to which deletion is required. This could be cumbersome when the data is to be deleted after the parent entity for which data is collected is deleted (from another database). i.e. if say a user is deleted from the system, all the events ingested in druid for that user id needs to be deleted. The events could be collected for reporting purpose specifically for that user. How can this be done?

Thanks,

Prathamesh

Hi Prathamesh:

Druid is not designed to do per row updating. But instead of using user names in this case, why don’t try using user ID instead, and create a lookup table to map ID with names, where it can be easily modified?

https://druid.apache.org/docs/latest/ingestion/update-existing-data.html

Thanks

Hi Prathamesh,

If you need to delete data for compliance reasons then you should batch up your deletes and run a nightly reindexing job. Otherwise I would suggest you just leave the data in and have a lookup that marks it as ‘hidden’ like Ming suggested.

Hi Vadim,

If we never delete data for certain entities (like users) which are no longer in our system, it would simply increase the storage requirements.

Not sure what you meant by “lookup that marks it as ‘hidden’”. Is this within druid or something that needs to be taken care at the application level? We want the data to be deleted from druid altogether. i.e. if the user deletes his/her account then we want to delete all the data for this user. Each event ingested in druid will have userId as one of the dimensions along with many other dimensions. I understand that druid doesn’t allow point deletions. But we want to be able to entirely delete data for this user.

Thanks,

Prathamesh

How can the deletion be triggered from the application instead of submitting re-indexing tasks?

Thanks,

Prathamesh

Currently the only option to modify Druid data is through re-indexing task.

Hi Ming,

Is there any chance of having alternative ways of doing this?

Thanks,

Prathamesh

Hi Ming,

One problem that is faced while deleting data using re-indexing task is that, if for a given time interval the data to be deleted (indicated by filter such as userId=“123”) happens to be the ONLY data in a segment, then re-indexing doesn’t delete that data. i.e. since there is nothing to re-index segment is not rebuilt resulting in data not being deleted. One way this could be worked around is by deleting the segment itself in this case.

But how does one know if for a given interval the segment only contains data be deleted and as such can be safely deleted?

Thanks,

Prathamesh

Hi,

Any leads on monitoring the outcome of re-indexing task to know if the data was not indexed due to segment only holding the data to be deleted ? In this case the data to be deleted can be deleted by deleting the segment directly.

Thanks,

Prathamesh