How to selectively delete data from druid

Hi Swapneel,

So if i understood you correctly, if you originally have 10 segments which are created through ingestion using kafka-indexing-service and if you re-index using ingestSegment firehose and specify a filter to drop some data (lets say DROP where dimensionX = “abc”),

  • 2 segments are dropped (assuming these 2 segments are the ones containing data where dimensionX = “abc” )

  • You end up with 8 segments

  • If you try deleting some more data by reindexing using ingestSegment firehose, and if you expect 3 segments to be deleted now, you still see 8 segments post reindexing (there should have been only 5 segments)

Is that correct?

Thanks,

Prathamesh

Hi Prathamesh,

I haven’t checked in terms of my segments, but in terms of raw events.

  • Say 150 events are present with 3 distinct unique ID’s

  • When I drop events based on NOT ID1, I can see the count decreasing from 150 to 100.

  • But when I drop events based on NOT ID2 (for the same time intervals), the count remains 100, with 0 events processed.

Hi Swapneel,

Sounds similar to what i am doing except that i am never able to delete data using ingestSegment. Although one difference is that i ingest data in real-time using tranquility core library.

Interesting that you have a hunch that this may be due to the method of Ingestion/Indexing (kafka-indexing-service, native ingestion task, real-time ingestion using tranquility). I am not sure why it would be designed to that way. Once the data in ingested it would be stored in segments. I don’t think segments generated by native ingestion would be any different from the ones generated by other ingestion method.

Are you sure the events you want to delete for NOT ID2, are under same time interval as the events for NOT ID1? Can you confirm this?

You can try one more thing:

  1. Ingest 150 events with 3 distinct unique ID’s

  2. drop events based on NOT ID1, Check if count decreases

  3. drop events based on NOT ID2 (for the same time intervals/or different if these events are from different interval), Check if count decreases

  4. As you said, 3 may not work. Try changing the destination datasource to something else and check if the events for NOT ID2 are created in the destination datasource. So:

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "your_datasource_here"

``

"ioConfig": {
      "type": "index",
      "firehose": {
        "type": "ingestSegment",
        "dataSource": "temporary_datasource",

``

**If you are seeing “with 0 events processed” , i am mostly thinking its because you don’y have events in that specified time period. **

Thanks,

Prathamesh

Hi Prathamesh, you are right. It indeed is an issue with my filters. I have documented the cause here.

One can selectively delete data from druid only when the segment that holds the data that one wants to delete has some other data which would be retained post re-indexing using ingestSegment. However when the data to be deleted happens to be the only data within a segment that segment is unchanged and data is not deleted.

This is apparent from this thread as well as discussion at: Reindexing data with IngestSegmentFirehose works first time, fails subsequent times

This means ingestSegment is not a reliable way to delete data from Druid. There should be some other mechanism in Druid that lets deletion of data.

Thanks,

Prathamesh

So how can deletion of data be done in druid without using load/drop rules or ingestSegment firehose?

Limitations of :

Load/drop rule - Cannot be used to selectively delete data. as in ***delete from datasource where dimensionA=“x” ***
re-index using ingestSegment - Works only when segment that contains data to be deleted has some other data which needs to be retained after re-indexing

Thanks,

Prathamesh