Deletion from deep storage after reindexing

Hi,

I am trying to figure out how to selectively delete data from druid. As per some recommendations in forum, i am trying to use reindexing using ingestSegment firehose to read data from existing segments and reindexing it by selecting only the data i want to retain using a NOT filter. I haven’t had any success with that yet.
Link for the thread: https://groups.google.com/forum/#!topic/druid-user/kTIxW5_-1og

What i wanted to understand better is what happens to the existing segments after a reindex task is successful. So for example if we reindex existing data choosing to NOT include the data where dimensionA=“xyz”. The reindexed data should not have any records where dimensionA=“xyz”. So in this case, what happens to the segments for dropped data? How would this get purged from deep storage. There was no mention of needing to run load/drop rules and kill tasks after reindexing using ingestSegment.

Thanks,

Prathamesh

Can you share the exact ingestion spec you’re using?
I’m particularly curious about the granularitySpec and the firehose section.

–dave

Hi Dave,

I am sharing original ingestion spec(attached - tranquilityConfig.json) which is used my java application that ingests the data into druid and the ingestion spec for indexing task (which i used to reindex data) - reindex.json which i send to **http://localhost:8090/druid/indexer/v1/task below: **

{

“type”: “index”,

“spec”: {

“dataSchema”: {

“dataSource”: “sampledatasource”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “time”,

“format”: “auto”

},

“dimensionsSpec”:{

“dimensions”:[“url”,“user”,“os”],

“dimensionExclusions”:[“time”]

}

}

},

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “minute”,

“queryGranularity”: “minute”,

“rollup”:“false”,

“intervals” : [“2019-02-27T00:00:00.000Z/2019-03-27T00:00:00.000Z”]

},

“metricsSpec”:[]

},

“ioConfig”: {

“type”: “index”,

“firehose”: {

“type”: “ingestSegment”,

“dataSource”: “sampledatasource”,

“interval”: “2019-02-27T00:00:00.000Z/2019-03-27T00:00:00.000Z”,

“filter”: { “type”: “selector”, “dimension”: “user”, “value”: “snowy”}

}

}

}

}

I checked the report for this indexing task. It shows:

{
“ingestionStatsAndErrors”:{
“taskId”:“index_sampledatasource_2019-02-28T12:18:57.814Z”,
“payload”:{
“ingestionState”:“COMPLETED”,
“unparseableEvents”:{
},
“rowStats”:{
“determinePartitions”:{
“processed”:0,
“processedWithError”:0,
“thrownAway”:0,
“unparseable”:0
},
“buildSegments”:{
“processed”:10,
“processedWithError”:0,
“thrownAway”:0,
“unparseable”:0
}
},
“errorMsg”:null
},
“type”:“ingestionStatsAndErrors”
}
}

the line:

  • “processed”:10,*

indicates that the task is reading correct number of records from the existing segments ingoring the ones i chose not to be included. But after the task completes “sampledatasource” still has the events i want to be dropped based on the filter. So if i have 25 events and i want to drop 15 of them as per the specified filter, after the task completes i still find all the events in druid. The segments are present at segment cache. I see following lines in the logs for this task:

2019-02-28T12:20:39,319 INFO [SimpleDataSegmentChangeHandler-0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[var/druid/segment-cache/sampledatasource/2019-02-28T11:34:00.000Z_2019-02-28T11:35:00.000Z/2019-02-28T11:34:14.507Z/0]

2019-02-28T12:20:39,332 INFO [SimpleDataSegmentChangeHandler-0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Deleting directory[var/druid/segment-cache/sampledatasource/2019-02-28T11:34:00.000Z_2019-02-28T11:35:00.000Z/2019-02-28T11:34:14.507Z]

But the segments are not deleted.

Thanks,

Prathamesh

tranquilityConfig.json (1.64 KB)

Hi Dave,

Any idea what could lead to druid not being able to delete the segments (both from deep storage and segment cache) which are excluded while re-indexing?

Thanks,

Prathamesh

Hi Prathamesh,

I think I figured out the issue. I guess it may be due to the way you are ingesting the data (check whether your ingest mechanism provides guaranteed exactly-once-ingestion). So what may be happening is, you are ingesting the data, then you delete the particular records, but the indexing service isn’t aware of the changes, so it ingests the data again and that may be why at query time you are seeing the data. I’m using kafka-indexing-service which provides guaranteed exactly once ingestion, so even after I purge the records, they aren’t reingested.

Try using a different ingestion mechanism and let me know.

Sorry, I’m not actually familiar with Tranquility.

Hi Swapneel,

Are you saying that reindexing works for same data source only if**** exactly-once-ingestion is guranteed?

“So what may be happening is, you are ingesting the data, then you delete the particular records, but the indexing service isn’t aware of the changes, so it ingests the data again and that may be why at query time you are seeing the data.” – ***I ingest data real-time using tranquility, then i run the re-index task applying a NOT filter to drop certain events. After the re-index task completes, i expect only the records not dropped by the filter would be present in the datasource. I never re-ingest the data i am expecting the re-index task to drop. ***

Now i am not sure how exactly the existing data from the datasource is supposed to be dropped. When using load/drop rules, one has to run a kill task to purge the dropped segments from Deep storage. That is why i was wondering if this “ingestSegment” can actually be used for selectively dropping data from a given datasource. I do actually see certain logs which indicates attempt to drop certain segments from segment-cache but the data is never dropped.

Certainly the load/drop rules can’t be used for selectively dropping data. Deletion seems complicated in druid.

Thanks,

Prathamesh