Modify input data & replace segments

Hi,

We are using the Kafka indexing service to ingest our data into Druid. The data from this topic is also stored on a distributed filesystem in avro files. In some rare case we need to modify this data, and re-ingest it with a batch job. I’d have some question regarding the steps to follow:

1, Should I stop the Kafka indexing task?

2, Should I manually remove the old segments or is it enough just to disable them?

3, If I don’t remove them beforehand, will the spark ingestion job just replace them or will it merge them with the new data?

Best regards,

Balazs

For #1 - Nothing wrong with this model. I’ve seen folks using kafka indexing when they want data in realtime and re-ingest later. At flurry, we have a separate data source for real time with a 2 day retention and a separate data source for batch ingestion.

#2 - You don’t have to manually remove old / unused segments. There is a setting called killDataSourceWhitelist in the coordinator UI. Adding your data source here should delete unused segments from metadata store and deep storage.

#3 - Anytime you do an ingestion for a specific interval, new segments will override / replace the old segments.

btw - here’s the link for killDataSourceWhitelist setting - www.druid.io/docs/latest/configuration/coordinator.html

Hi Praveen,

Thanks for your answers. In your case they are your backend
services who decides how to dispatch a query between the real time
and the batch datasource?

I think my first question might not have been clear. I wanted to
know whether I have to temproraly stop the kafka indexing job
during the batch job for the same datasource, or they might turn
in parallel?