We are using the Kafka indexing service to ingest our data into Druid. The data from this topic is also stored on a distributed filesystem in avro files. In some rare case we need to modify this data, and re-ingest it with a batch job. I’d have some question regarding the steps to follow:
1, Should I stop the Kafka indexing task?
2, Should I manually remove the old segments or is it enough just to disable them?
3, If I don’t remove them beforehand, will the spark ingestion job just replace them or will it merge them with the new data?
For #1 - Nothing wrong with this model. I’ve seen folks using kafka indexing when they want data in realtime and re-ingest later. At flurry, we have a separate data source for real time with a 2 day retention and a separate data source for batch ingestion.
#2 - You don’t have to manually remove old / unused segments. There is a setting called killDataSourceWhitelist in the coordinator UI. Adding your data source here should delete unused segments from metadata store and deep storage.
#3 - Anytime you do an ingestion for a specific interval, new segments will override / replace the old segments.
btw - here’s the link for
killDataSourceWhitelist setting - www.druid.io/docs/latest/configuration/coordinator.html
Thanks for your answers. In your case they are your backend
services who decides how to dispatch a query between the real time
and the batch datasource?
I think my first question might not have been clear. I wanted to
know whether I have to temproraly stop the kafka indexing job
during the batch job for the same datasource, or they might turn