How To Handle Replaying Data with Duplicates

  • Druid Version: 0.22.1
  • Kafka Ingestion (idempotent producer)

I am looking for advice on how to handle a specific scenario. We have extractors that pull data from a specific source and does transformation/validation and finally sends it to a service that commits the data to kafka for ingestion by druid. Due to infrastructure issues, we sometimes have outages at the kafka producer service.

Our service that sends the data to the kafka producer will archive these messages if the producer does not acknowledge the incoming message (http server in front of the kafka producer). I noticed that sometimes (when there are infrastructure or network related issues), the producer will still commit the incoming messages but respond back to our sending service with a 50X error. The sending service will then archive these messages for resending at a later time.

Since some of these messages were already committed, when these archived messages are replayed/resent, this results in duplicate rows in druid (as shown in my screenshot).


This duplicate has the exact same timestamp/__time field and other columns match as well.

I am looking for advice on how to do/handle the following:

  1. How do we clean up duplicate rows like this? Also any settings/options to prevent this before it happens perhaps?
  2. We process/send to Kafka in batches, is there a way to do a quick check before replaying/resending the archived data so that we can remove the data that already exists in druid?
  3. Or perhaps there is a better way to replay this data instead of going through Kafka again? I am open to any advice/thoughts

Would appreciate any help/advice. Thanks!

Hi @Peter_Chang,

Welcome! A perfect rollup might be a good place to start to answer your first question. Here’s a blogpost with a detailed explanation, and a tutorial if you’d like to give it a try.

I also came across this discussion from awhile ago, and it may overlap somewhat with your current scenario.

Let us know how it goes.

Best,

Mark