I’m trying to figure out what I’d need to do to remove certain subsets of data from Druid (0.12.1). For example, I’d like to be able remove all data associated with a specific customer identifier. I’d need to run this task rarely (if ever), but for business reasons I need the ability to do this. Is this possible with Druid?
I’ve seen re-indexing tasks with Hadoop, but I’d like to avoid having to spin up and manage Hadoop, as we don’t currently use it. Another thought that I had was that I could implement my own ingestion task to drop rows that meet certain criteria, but I’m not sure if this is a good path to go down.
Has anyone done this before or is there something I missed in the docs/repo that could help me achieve this?
– Ryan Plessner
You could do the same thing with native index tasks using the “ingestSegment” firehose + a “filter” (one of the parameters of the ingestSegment firehose) that only retains the data you want to retain.
Thanks for the response. I had seen this in the docs after posting this question, but the docs also say “Please use Hadoop batch ingestion for production scenarios dealing with more than 1GB of data” in reference to using this task. One of the limiting factors of the native index tasks appears to be that you can’t allocate more than 1 core of middle manager capacity to them. So I had kinda written this off as a solution especially since the data in my cluster is well over 1 gig per day and I’m hoping to store 6 months to a year of data. I’m already using the native indexing tasks in the form of compaction tasks and it takes awhile for just an hour of data. Am I understanding how indexing tasks work in terms of capacity on the middle managers or can they run more than 1? I guess the another solution would be to identify and remove some dimensions that are seriously affecting my rollups and that reduce my data size and make native indexing tasks viable, but that’s a me problem.