Steps to Modify an Existing Record

We have a bunch of records coming in through kafka. Let’s say we have a wrong value in exactly 1 record, 1 column/field. What are the steps required to make that change?

From what I read (almost every single page on druid.io), we need to:

  1. prepare this row with corrected value as a input format (either as csv, or kafka message)

  2. use the “Index Task” to load it in

But I don’t understand if we need to do the following

  • delete old data?

  • merge into the old segment?

  • query the old record first to find the old segment?

It’s quite difficult to understand the steps that needs to be done to update a record. If someone can list out the steps required, and how, that’d help a lot.

A related question: what if we are not trying to merge or overwrite an existing record, but a new record just happens to have the exact same timestamp as one of our older records? What would happen in druid? Would druid replace it, or simply add the record without problems?

Please help,

Geoff

Hey Geoff,

The general idea is to go back to your raw data and regenerate the segment that contains the event. Batch ingestion is a good way to do this, since it provides atomic replacement of segments on a by-interval basis.

So the steps are:

  1. Have a copy of your raw data in something like S3 or HDFS, partitioned by time. Secor or Camus (or maybe Gobblin?) can do this.

  2. Fix your raw data.

  3. Run a Druid batch indexing job with the appropriate interval, on the appropriate raw data (all data for that interval).

For #2 you can use the IndexTask for small amounts of data (<GB or so) or Hadoop indexing for any amount of data. When the job in #2 completes, the new fixed segments will atomically replace the old segments.