Deleting data using a distinctive dimension value

I need to delete all data that has a specific dimension value (e.g. “id”=“application_01”).

I followed the instructions in your tutorial:

I understood that with Druid you can only delete per segment. Therefore my plan was to run a compaction task using a partitionsSpec for my dimension “id” like:

{
  "type": "compact",
  "dataSource": "my-data-source",
  "tuningConfig": {
    "type": "index_parallel",
    "maxRowsInMemory": 100000,
    "partitionsSpec": {
      "type": "single_dim",
      "targetRowsPerSegment": 250000,
      "partitionDimension": "id",
      "assumeGrouped": true
    }
  }
}

Reference:

This re-partitioned my data but when querying the segmentMetadata with:

{
  "queryType":"segmentMetadata",
  "dataSource": "my-data-source",
  "context":"true",
  "toInclude": { "type": "list", "columns": ["id"]},
  "intervals":["2021-01-01/2021-03-01"]
}

I realised that my segments still contain different values for the dimension “id”.

That means, deleting this segments would also delete data that I don’t want to delete.

Please tell me, is there a partitionsSpec that partitions segments using distinctive values of a dimension or is there in general an easier solution for my problem?

Relates to Apache Druid 0.20.1

I think I am interpreting a solution posed by @Muthu_Lalapet correctly here, but I haven’t tried it myself.

I think It can be done by reindexing the data and using filters in transform spec. Please see the below links for more details

Ingestion · Apache Druid - Please see the filter section

1 Like

Hi Rachel, thank you very much for your reply. I think your suggested solution should work for me. Unfortunately I ran into another issue.

I created a reindexing spec using a “not” filter:

{
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "druid",
        "interval": "1970-01-01/2020-02-01",
        "dataSource": "my-data-source"
      }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": {
        "type": "dynamic"
      },
      "maxRowsInMemory": 100000
    },
    "dataSchema": {
      "dataSource": "my-data-source",
      "granularitySpec": {
        "type": "uniform",
        "queryGranularity": "HOUR"
      },
      "timestampSpec": {
        "column": "!!!_no_such_column_!!!",
        "missingValue": "1970-01-01T00:00:00Z"
      },
      "dimensionsSpec": {},
      "transformSpec": {
        "filter": {
          "type": "not",
          "field": {
            "type": "selector",
            "value": "application_01",
            "dimension": "id"
          }
        }
      }
    }
  }
}

It runs successful but throws a lot of warnings like:

WARN [task-runner-0-priority-0] org.apache.druid.indexing.input.DruidSegmentInputEntity - Could not clean temporary segment file: var/druid/task/index_parallel_my-data-source_maihjbio_2021-05-11T12:36:55.343Z/work/indexing-tmp/my-data-source/2020-01-28T00:00:00.000Z_2020-01-29T00:00:00.000Z/2020-01-28T15:41:40.130Z/1

It should filter/delete 452482 rows but the count keeps the same.
One line in the logs also suggests that:

INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Processed[0] events, unparseable[0], thrownAway[452,482].

Do you have any idea why the filtered data is not dropped/deleted?

Please see the full log here:
https://drive.google.com/file/d/1H6bDZmbK3QqUWX2a5vmtmclpv_meJ6uE/view

I guess the INFO task is saying that the right number were thrownAway, but none were processed. Is that what you are seeing when you query the data? Sorry, just trying to get my head around the issue here.

Yes, when I’m querying the data with:

SELECT
  id, COUNT(*) AS "Count"
FROM "my-data-source"
WHERE id = 'application_01' AND __time <= TIMESTAMP '2020-02-01 00:00:00'
GROUP BY 1
ORDER BY 2 DESC

I get a count of 452482 (before and after reindexing). The task is not able to process/filter/drop the data and I don’t understand why.

If I can provide any further information to help figure out what the problem is, I’ll be happy to provide it. Thank very much you so far.