I’m trying to understand a little bit more around the update use case, and how we would update our data after it has been loaded in druid.
Lets say I have a Kafka topic with 5 partitions, and we have an Kafka-indexing task to that will create 15 minute segments.
Now lets say we have now streamed in and processed data for 2019-01-01
Now one day later 2019-01-02, we have determined that we have some records that have changed, say we found 100 records at: 2019-01-01T20:00:00
So now I can ‘identify the’ the time-chunk (I believe that is the term?), which would be, 2019-01-01T20:00 logically to < 2019-01-01T20:15:00
Since we streamed it in via kafka, this means we in fact will have 5 segments where this data could live?, and our ‘updated/new data’ may be 1 or more segment files in that time chunk.
So in this case…we could:
Option1:
-
Extract all the data from druid, with a query,
-
Merge it offline into a data file (with the updated/new records)
-
Create new ‘FileIndexing task’, and re-process
–> This would now ‘replace’ the old 5 indexes at that time chunk?, or would we need…also then delete the old segments?
Option 2:
-
Create a new Data file (not via druid extract), with the proper data for that 15 minute window,
-
dump to to file.
-
Create new ‘FileIndexing task’, and re-process
–> Again same question above: This would now ‘replace’ the old 5 indexes at that time chunk?, or would we need…also then delete the old segments?
Option 3:
-
Create a NEW Kafka TOPIC, and dump our merged data into that topic.
-
Drop/Kill the old time-chunk (the 5 segment files)
-
Create new Kafka-Indexing-Task, (with the same data-source-name)
Would all 3 of the options make sense? and are they even possible? (I’m suspecting option 3 is maybe the most questionable?)
Dan