How to selectively delete data from druid

Hi,

I would like to delete specific data from druid based on certain filters.

for example: For a given time interval, delete all the events for which dimension A = “xyz” and dimension B = “pqrs”.

How can this be achieved?

Thanks,

Prathamesh

Prathamesh,

IMO, you can only overwrite the segment with necessary data.

http://druid.io/docs/latest/tutorials/tutorial-update-data.html

Thanks & Rgds

Venkat

Hi Venkat,

I don’t want to update the existing data. I simply want to delete some of the existing data from druid.

For. eg. DELETE from movies where released_year = “2012”

How do i do something as above in druid? I know druid works best when use case doesn’t involve frequent updates/upserts to data. But it should provide facility to selectively delete data.

Thanks,

Prathamesh

Hi prathamesh,

The way I am doing this is, first use a time boundary query to figure out the intervals where your data lies. Then using that, post an indexing task with the ioconfig as ingestSegmentFirehose to reindex the said data. While reindexing, you can put the filters to NOT include whatever your said data is.

Hi Swapneel,

This sounds like something that could help me with my requirement. I’ll try this.

Thanks alot!

Hi Swapneel,

I was able to get the intervals using time boundary query. However i am not able to run the indexing task at http://localhost:8090/druid/indexer/v1/task. Do you have a sample json that you use to delete data using reindexing ? I am looking for the body that needs to go with the POST request to http://localhost:8090/druid/indexer/v1/task.

Thanks,

Prathamesh

### Reindex the data without the specified data
POST [http://localhost:8090/druid/indexer/v1/task
Content-Type](http://localhost:8090/druid/indexer/v1/task
Content-Type): application/json

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "your_datasource_here",
      "parser": {
        "type": "string",
        "parseSpec": {
          "format": "json",
          "timestampSpec": {
            "column": "timestamp_column",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "NONE",
        "intervals": ["2019-02-21T10:50:38.816Z/2019-02-21T11:08:11.331Z"]
      }
    },
    "ioConfig": {
      "type": "index",
      "firehose": {
        "type": "ingestSegment",
        "dataSource": "your_datasource_here",
        "interval": "2019-02-21T10:50:38.816Z/2019-02-21T11:08:11.331Z",
        "filter": {
          "type": "not",
          "field": {
            "type": "and",
            "fields": [
              {
                "type": "selector",
                "dimension": "fieldA",
                "value": "abc"
              },
              {
                "type": "selector",
                "dimension": "fieldB",
                "value": "def"
              },
              {
                "type": "selector",
                "dimension": "fieldC",
                "value": "xyz"
              }
            ]
          }
        }
      }
    }
  }
}

Hi Swapneel,

Thanks for your help. I tried with the json sample you shared. I was able to run a task to reindex the data. I could see the task being successful. However the data i am trying to delete is still there.

Can ingestSegment be used to reindex data using same datasource? I mean in the json datasource for both ioConfig and dataSchema is same (It should be since we want to delete data from a given datasource and not move it elsewhere). So as per the documentation ingestSegment****firehose reads from existing segments and reindex the data as per the specifications that are mentioned in the index task.

Not sure if i am missing something. Even after the reindex tasks being successful, the data still seems to be there (as can be seen by firing a groupBy query). Does seperate task need to be run to purge the data after reindexing?

Thanks,

Prathamesh

Can you share the logs of the reindex task?

Hi Swapneel,

Please find attached the logs for re-index task.

I have following data in my druid setup :

[

{

    "version": "v1",

    "timestamp": "2019-02-25T08:15:00.000Z",

    "event": {

        "count": 5,

        "os": "android",

        "user": "haddock"

    }

},

{

    "version": "v1",

    "timestamp": "2019-02-25T08:15:00.000Z",

    "event": {

        "count": 5,

        "os": "android",

        "user": "tintin"

    }

},

{

    "version": "v1",

    "timestamp": "2019-02-25T08:30:00.000Z",

    "event": {

        "count": 5,

        "os": "android",

        "user": "tintin"

    }

}

]

I want to delete all the events where user=“tintin” which are spread across 2 segments.

Please find attached the logs for reindex task(log.txt), tranquility config(tranquilityConfig.json) which my application uses to ingest data in druid, response of time boundry query (timeboud_query.txt) and segment details captured from the segments listed at localhost:8081/#/datasources/sampledatasouce (druid_datasouce_segments.txt)

After running reindex task i can still see data for user=“tintin”

Not sure what i am missing. Let me know if you spot something.

Thanks,

Prathamesh

druid_datasouce_segments.txt (1.82 KB)

log.txt (104 KB)

timeboud_query.txt (362 Bytes)

tranquilityConfig.json (1.69 KB)

Hi Prathamesh,

The logs look strange, it emits task status as success whereas the logs do show the task failed at the end. Can you confirm if the historical nodes aren’t down? https://github.com/apache/incubator-druid/issues/3851

Hi Swapneel,

That is indeed very strange. I didn’t notice this until you pointed out. The task status was SUCCESS in the overlord console as well. One thing to mention here is that, I am using single node setup (for trying out this reindex functionality) so all my nodes run on same machine (druid quickstart guide). I ingest the data real-time using tranquility. After the duration of real-time task, the data can still be queried which means the handoff of segments to historical node is successful. so there shouldn’t be any problem with historical node.

Thanks,

Prathamesh

Hi ,

Found something even strange.

If you were talking about:
2019-02-26T05:46:14,854 INFO [main] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task index_realtime_XXXXXXXX_2019-02-26T05:15:00.000Z_0_0] status changed to [FAILED].

This error exists in all the logs for successful tasks. not just Reindex tasks but for realtime tasks as well. I have been using realtime tasks for a while now and i have never seen any issue with data ingested by these tasks. I am thinking that reindexing not working may not be due to task [FAILED] error.

Thanks,

Prathamesh

Hi Swapneel,

Can we actually use ingestSegment to delete data?

What i did was, used the reindex task to read from my datasource_1 (specified under ioConfig) and reindex dropping certain rows based on a filter but i set the datasource under dataschema to a new datasource, datasource_2. I found that datasource_2 had only events that i wanted i.e. all the rows i didn’t want were dropped successfully as specified by filter. So this worked fine. I could still see task status changed to FAILED in logs after emitting SUCCESS. But clearly that doesn’t matter.

Setting datasource to a new datasource:

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "datasource_2",
      "parser": {
        "type": "string",

``

Reading from existing datasource:

    "ioConfig": {
      "type": "index",
      "firehose": {
        "type": "ingestSegment",
        "dataSource": "datasource_1",

``

Were you able to use ingestSegment to drop data from a datasource. i.e. reindex for the same datasource and drop certain rows based on filter?

It looks like ingestSegment doesn’t work when source and destination datasource is same.

Thanks,

Prathamesh

Hi Prathamesh,

Reindexing seems to work for me within the same data source. I too am using a single node cluster of druid, but I’m using docker compose to run the nodes.

Hi Prathmesh,

You will need to reindex all the segments with “NOT” filter. Below was my ioconfig for reindexing :
“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “dataSource”,

“ingestionSpec”: {

“dataSource”: “cust”,

“intervals”: [“2013-01-01/2019-01-01”],

“filter”: { “type”: “not”, “field”:{ “type”: “selector”, “dimension”: “id”, “value”: “1234”}}

}

}

},

Hi Swapneel, Durgadas,

I am able the run the tasks with the configuration suggested for reindexing, but i am not able to see the rows actually being dropped.

I also checked the task report which looks like:
{
“ingestionStatsAndErrors”:{
“taskId”:“index_sampledatasouce_2019-02-27T07:22:30.605Z”,
“payload”:{
“ingestionState”:“COMPLETED”,
“unparseableEvents”:{
},
“rowStats”:{
“determinePartitions”:{
“processed”:0,
“processedWithError”:0,
“thrownAway”:0,
“unparseable”:0
},
“buildSegments”:{
** “processed”:15,**
“processedWithError”:0,
“thrownAway”:0,
“unparseable”:0
}
},
“errorMsg”:null
},
“type”:“ingestionStatsAndErrors”
}
}

The line **“processed”:15 indicates that correct number of records are found when reading from the datasource (i.e. not including the other records i want to drop). **

**I also checked logs for historical node. It shows me attempt to actually delete the segments which has the events i want to drop. However after the task completes i still see the segments present in segment-cache. (I have attached the logs from my historicals indicating attempt to delete segment-cache for segments that are not selected) **

Do you guys see any completion status/log in you historical which indicates that segment deletion was successful?

Thanks,

Prathamesh

segmentdroologs.txt (2.12 KB)

Hi Fangjin, Gian,

Data deletion seems complicated in druid. Can you help with this query?

Thanks,

Prathamesh

Hi ,

Apart from re-indexing (which doesn’t see to work for same datasource) and load/drop rules, is there any other way to delete data from druid?

I am not able to achieve a very simple thing in druid. i.e. Druid Equivalent of :

DELETE FROM datasource WHERE dimensionX=“A”

Thanks,

Prathamesh

Hi Prathamesh,
Even I’m facing the same roadblock as you are. Reindexing segments does work in order to delete the said data for me, but a very weird observation - after the data has been indexed once for a certain time interval, any subsequent indexing tasks I perform for that intervals does not work - with logs similar to yours. 0 events processed. I surmise that this may have something to do with the way the data was indexed - if it was ingested by kafka-indexing-service, then updating the segments work by reindexing, however, if they were ingested by native batch indexing jobs, they dont work. I am yet to ascertain the cause of this cryptic behaviour.