The issue lies in the filters passed to ingestSegmentFirehose.
For context, this is what I’m trying to achieve. I’m trying to delete data from within segments by reindexing. For that I’m using an index task with IngestSegmentFirehose. My Index task looks like this
{
“type”: “index”,
“spec”: {
“dataSchema”: {
“dataSource”: “testData”,
“parser”: {
“type”: “string”,
“parseSpec”: {
“format”: “json”,
“timestampSpec”: {
“column”: “timestamp”,
“format”: “auto”
},
“dimensionsSpec”: {
“dimensions”: ,
“dimensionExclusions”: ,
“spatialDimensions”:
}
}
},
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “DAY”,
“queryGranularity”: “NONE”,
“intervals”: [“2016-01-01T13:00:00.000Z/2019-02-21T13:56:52.889Z”]
}
},
“ioConfig”: {
“type”: “index”,
“firehose”: {
“type”: “ingestSegment”,
“dataSource”: “testData”,
“interval”: “2016-01-01T13:00:00.000Z/2019-02-21T13:56:52.889Z”,
“filter”: {
“type”: “not”,
“field”: {
“type”: “and”,
“fields”: [
{
“type”: “selector”,
“dimension”: “aId”,
“value”: “71F9BCD4-3DDE-48BC-84DE-89D7DB987B4E”
},
{
“type”: “selector”,
“dimension”: “bId”,
“value”: “DAB5009E-BF0E-430E-BF75-CF2E3A4FD739”
}
]
}
}
}
}
}
}
So first time this job works because the data which I’m deleting from the segments is intermixed with other data, so the filters in IngestSegmentFireHose actually have something to pick up. However, when there’s only one possible combination of (aId, bId) in the datasource, there’s nothing left for the firehose to reindex, so rather than dropping those, it just returns logs like these:
2019-03-07T08:57:05,700 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Dropping segments[[]]
2019-03-07T08:57:05,706 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Pushed segments[[]]
2019-03-07T08:57:05,707 INFO [publish-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - Nothing to publish, skipping publish step.
2019-03-07T08:57:05,707 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Processed[0] events, unparseable[0], thrownAway[0].
2019-03-07T08:57:05,708 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Published segments[]
Now, the question is, is there an attribute in IngestSegmentFirehose to drop/delete/ignore the events as set in the filters?