Purging data selectively from druid

Hi,

I am pretty new to the druid and have a task at hand of purging the data from deep storage selectively.

We have data source which store dat of multiple clients. We need to delete the data for one particular client only.

I saw in druid deleting data tutorial that we can disable segments and delete them. But in our case if we do it it will delete data for other customer as well.

In short what we want is druid equivalent for "DELTE FROM CUSTOMER WHERE CUSTOMER_NAME=‘XXX’ "

Can you please guide me in procedure for achieving this?

1 Like

Hey Durga,
You need to Sert a Drop Rule in the Coordinator Console, you can view coordinator console on 8081 port of machine

Thanks Kiran. I considered that option but creating drop rule will drop all the segments(which contains data from multiple clients) based on intervals, while we want to delete all data till date selectively for particular customer .

Hey Durgadas,

An option would be to reingest the segment but include a filter that only selects the data you want to keep.

If you’re using Hadoop Batch Ingestion, you can specify an InputSpect of type datasource which can take a filter. There’s some more information here: http://druid.io/docs/latest/ingestion/update-existing-data.html

Here’s a rough example:

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "dataSource",
        "ingestionSpec": {
          "dataSource": "{replace_me}",
          "intervals": [
            "{replace_me}"
          ],
          "filter": {
            "type": "not",
            "field": {
              "type": "selector",
              "dimension": "client_id",
              "value": "drop_this_client"
            }
          }
        }
      }
    },
    "dataSchema": {
    },
    "tuningConfig": {
      "type": "hadoop",
      "partitionsSpec": {
      },
      "jobProperties": {
      }
    }
  }
}

Note that you can do the same thing with native batch ingestion (new in 0.13) if you are not set up for Hadoop.

Thanks Dylan and David for response.
I am using tranquility kafka for ingestion.

The approach I am trying to do for purging is :

  1. Index new segments with filtered customer id by submitting below spec.

  2. Then disable the segments in #1 and delete them.

For Step #1 :

I tried to post below spec to overlord but facing error :

curl -X ‘POST’ -H ‘Content-Type:application/json’ -d @deletion-index.json http://<Overlord_ip>:8080/druid/indexer/v1/task

error :

{“error”:“Instantiation of [simple type, class io.druid.indexing.common.task.IndexTask] value failed: Optional.get() cannot be called on an absent value”}

The version of druid is 0.9. Can you please let me know if that is the correct approach and how i can get over this error?

The approach I am trying to do for purging is :

  1. Index new segments with filtered customer id by submitting below spec.

  2. Then disable the segments in #1 and delete them.

If you wanted to remove rows where custId is 123, the reindexing job should filter on custId != 123 instead (i.e., preserve all the rows that do not belong to customer 123). The new segments (without cust 123) will replace the old set (with cust 123), so step #2 is not necessary.

error :{“error”:“Instantiation of [simple type, class io.druid.indexing.common.task.IndexTask] value failed: Optional.get() cannot be called on an absent value”.

The version of druid is 0.9. Can you please let me know if that is the correct approach and how i can get over this error?

I would check the overlord logs and look for the full stack trace for that error, it would give you more information on what field its trying to access, something is missing or malformed in the ingestion spec.

It also looks like you’re basing the ingestion specs on examples from the new tutorials that were added around Druid 0.12.3, I would recommend using the current 0.13.0-incubating since these example specs are written for newer versions of Druid (for example, transformSpec does not exist in older versions).

Thanks a lot. I went through the logs to find out error in the spec and posted reindexing with Hadoop batch ingestion successfully. Below was the spec for it. But when I trigger the select query again(for purged id), I am seeing the records and it seems data is not actually deleted. The segment ids returned by the select query are having “used” column value 1 in “segments” table in mysql.

Select Query :

{
“queryType”: “select”,
“dataSource”: “customer”,
“intervals” : [“2015-12-20/2018-12-20”],
“descending”: “false”,
“dimensions”:,
“metrics”:,
“granularity”: “ALL”,
“filter”: { “type”: “selector”, “dimension”: “id”, “value”: “123”},
“pagingSpec”:{“pagingIdentifiers”: {}, “threshold”:1000}

``

Spec :

{
“type” : “index_hadoop”,
“spec” : {
“dataSchema” : {
“dataSource” : “customer”,
“parser”: {
“type”: “string”,
“parseSpec”: {
“timestampSpec”: { “column”: “timestamp”,“format”: “millis”},
“dimensionsSpec”: {
“dimensions”: [“id”,“dim2”,“dim3”,“dim4”]
},
“format”: “json”
}
},
“metricsSpec”: ,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “DAY”,
“queryGranularity” : “hour”,
“intervals”: [“2013-01-01/2019-01-01”]
}
},
“ioConfig”: {
“type”: “hadoop”,
“inputSpec”: {
“type”: “dataSource”,
“ingestionSpec”: {
“dataSource”: “customer”,
“intervals”: [“2013-01-01/2019-01-01”]
},
“filter”: { “type”: “not”, “field”:{ “type”: “selector”, “dimension”: “id”, “value”: “123”}}
}
},
“tuningConfig” : {
“type”: “hadoop”
}
}
}

``

I tried changing input spec as below but it did not help. Can someone please help.
“inputSpec”: {
“type”: “dataSource”,
“ingestionSpec”: {
“dataSource”: “customer”,
“intervals”: [“2013-01-01/2019-01-01”]
},
“filter”: { “type”: “not”, “field”:{ “type”: “selector”, “dimension”: “id”, “value”: “123”}}
}

``

I mean like below :
“inputSpec”: {
“type”: “dataSource”,
“ingestionSpec”: {
“dataSource”: “customer”,
“intervals”: [“2013-01-01/2019-01-01”],

            "filter": { "type": "not", "field":{ "type": "selector", "dimension": "id", "value": "123"}}
        }
        
    }

Hi Durgadas,

Were you able to figure this out? I am trying to do something similar. Although i am not using hadoop. I am using ingestSegment to read from existing segments and trying to selectively delete data based on certain filters.

Thanks,

Prathamesh

Yes. You will need to reindex all the segments with “NOT” filter. Below is my ioconfig :
“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “dataSource”,

“ingestionSpec”: {

“dataSource”: “cust”,

“intervals”: [“2013-01-01/2019-01-01”],

“filter”: { “type”: “not”, “field”:{ “type”: “selector”, “dimension”: “id”, “value”: “1234”}}

}

}

},