Realtime Kafka Ingestion to have old archive of another ingestion

I have a realtime kafka ingestion, let’s call it OLD. It’s working fine and nothing is wrong with it.

I want to create a second kafka ingestion, let’s name it NEW. It has different params.

I have started the new, and it is working. It has data from 2022, not 2021.

My Question:

Now, if I run druid native re-indexing, from datasource OLD to NEW, will it prepend the data of 2021 into the realtime ingestion of NEW, without hurting any data in OLD? As if those data have been ingested by itself, datasource NEW. Is that possible?

Relates to Apache Druid 0.20.1

I have not actually tried this out myself and even if it does work, it is risky. You would have to be very careful about the intervals you use in your re-indexing and make sure that there are no events for that interval arriving when the reindexing is happening.

If it does work, your original data in OLD will not be affected.

I would thoroughly test to verify and suggest creating a new datasource with the reindexed data from both OLD and NEW instead of modifying NEW

1 Like

I assume that by “native re-indexing” you mean Druid native batch ingestion. If you ingest the old datasource into the new data source using native batch ingestion and assuming that the intervals that contain data in old never overlap those in new then batch ingestion (which seems true in your case) from old to new will work as you expect. If some intervals overlap then you need to decide whether old will replace new or not. Batch ingestion has a parameter in the ioConfig, appendToExisting, which defaults to false which means that by default batch ingestion will overwrite existing data in overlapping intervals in old and new. It can get complicated depending on the old and new segment granularities and details like that so make sure you do some tests before attempting the real thing. Good luck!