Data deletion in druid

Hi Guys,

Are there any plans to have more reliable deletion functionality in Druid in upcoming releases? Is this functionality provided by Imply by any chance?

There were couple of threads about using ingestSegment firehose and load/drop rules for deleting data and limitations around it.

How to selectively delete data from druid

ingestSegment Firehose Doesn’t update segments when all the data needs to be deleted

Reindexing data with IngestSegmentFirehose works first time, fails subsequent times

Being able to selectively purge data from druid as all the data collected cannot be kept around once it is not required. This seems to be missing in druid currently. Not sure how folks who use Druid in production are dealing with this currently.

Thanks,

Prathamesh

Hi Gian,Fangjin,

Is this something we may see in Druid going forward?

Thanks,

Prathamesh

Hi Prahtamesh,

Druid segments are immutable. It is core to Druid’s architecture and I don’t expect that to change.

However, Druid does support updating/deleting existing data through a process called reindexing, which involves creating new segments out of existing ones.

See: http://druid.io/docs/latest/ingestion/update-existing-data.html

For deleting data in reindexing, you can add a Transformation Spec with a row filter to your Ingestion Spec.

http://druid.io/docs/latest/ingestion/transform-spec.html

Please note that this is probably not relevant for deletes that involves joins, and for that case, I would suggest reprocessing the data outside of Druid (with Hadoop, etc).

Hi Eyal,

I have tried to delete data using reindexing but found that ingestSegment firehose is not able to reindex and delete data in case there is nothing to retain.

Thread: ingestSegment Firehose Doesn’t update segments when all the data needs to be deleted

Thanks,

Prathamesh

Hi Eyal,

when i tried deleting data from druid using reindexing, what i found was the deletion is successful only when the segment contains data that would be retained after reindexing.

For example, if segment contains data for userId=“12345” and userId=“45678”, when reindexing is done with filter being specified to only retain data for userId=“45678” it works fine.

But if the segment contains data for only userId=“12345” and i want to delete data for this segment, i would run a reindex task by specifying filter to drop all events for userId=“12345”. However new segment is not recreated in this case since the new segment would not have any data as we are not retaining any data from the old segment.

I think this maybe a bug in ingestSegment firehose but not sure if its by design.

Here’s what David Glasser mentioned about it:

I actually added a warning about this to the docs recently, because it surprised me too: https://github.com/apache/incubator-druid/pull/7046/files#diff-4c1b0fe70b6e47e926987b1c496035b7R130

I think it would be nice to be able to pass a flag to batch ingestion that causes it to fill any gaps inside the interval from your granularitySpec that aren’t covered by new segments with empty new segments, to achieve this goal.

So basically deletion doesn’t work well when data to be deleted happens to be the only data in a segment.

Thanks,

Prathamesh

I see now, thanks.

I think this is really an edge case, where you want to delete data by a dimension (rather than time) but that dimension has a single value for all events in the segment.

I think in that case you could probably work around that by implementing your own logic:

  1. Find segments that needs to be manually deleted. Perhaps you could use segmentMetadata query with minMax on the relevant dimension.

  2. Permanently delete the segment by ID: http://druid.io/docs/latest/tutorials/tutorial-delete-data.html

I agree this is not a very clean workaround but I hope this works.