Updating / replacing data that was loaded via Kafka-Indexer-task

I’m trying to understand a little bit more around the update use case, and how we would update our data after it has been loaded in druid.

Lets say I have a Kafka topic with 5 partitions, and we have an Kafka-indexing task to that will create 15 minute segments.

Now lets say we have now streamed in and processed data for 2019-01-01

Now one day later 2019-01-02, we have determined that we have some records that have changed, say we found 100 records at: 2019-01-01T20:00:00

So now I can ‘identify the’ the time-chunk (I believe that is the term?), which would be, 2019-01-01T20:00 logically to < 2019-01-01T20:15:00

Since we streamed it in via kafka, this means we in fact will have 5 segments where this data could live?, and our ‘updated/new data’ may be 1 or more segment files in that time chunk.

So in this case…we could:

Option1:

  1. Extract all the data from druid, with a query,

  2. Merge it offline into a data file (with the updated/new records)

  3. Create new ‘FileIndexing task’, and re-process

–> This would now ‘replace’ the old 5 indexes at that time chunk?, or would we need…also then delete the old segments?

Option 2:

  1. Create a new Data file (not via druid extract), with the proper data for that 15 minute window,

  2. dump to to file.

  3. Create new ‘FileIndexing task’, and re-process

–> Again same question above: This would now ‘replace’ the old 5 indexes at that time chunk?, or would we need…also then delete the old segments?

Option 3:

  1. Create a NEW Kafka TOPIC, and dump our merged data into that topic.

  2. Drop/Kill the old time-chunk (the 5 segment files)

  3. Create new Kafka-Indexing-Task, (with the same data-source-name)

Would all 3 of the options make sense? and are they even possible? (I’m suspecting option 3 is maybe the most questionable?)

Dan

Hey Dan,

Option 2 is the most common that I have seen in real world situations. It takes advantage of the fact that Druid’s batch indexing defaults to ‘overwrite by interval’ mode, meaning you can use it to replace stream-ingested data.

People do option 1 sometimes too and it can also work - you just need to export the data first.

Option 3 wouldn’t work as you’ve written it, since you can only have one topic reading into a datasource at a time. But, you could tweak it slightly: first, drop the old time chunk; then, write the fixed / merged data into the original kafka topic. That would work.