Remove duplicate row in realtime ingestion

Hi All,

We are working on loading realtime data feeds to druid using kafka indexing service. There are cases, where source system sends duplicate rows. Is there a way to avoid loading duplicate row to druid datasource?

Thanks

Soumya

Soumya,

KIS provides “exactly once” semantics which means it guarantees that there will no duplicates that will be processed.

Rommel Garcia

Kafka indexing service’s exactly one semantics mean that it’ll process each message once. If the writer writes duplicates that means it’ll send multiple messages and those semantics won’t be of any help.

You’d need to come with a different solution for de-duplication or consider a “lambda architecture” where real-time data is periodically replaced by more accurate data produced from a batch process.

Best regards,

Dylan

Thank you for the response. The case is that we get duplicates into Kafka topic itself.

Hey Soumya,

You might be able to use Kafka idempotent producer configs to dedupe on the producer side: https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/. If that won’t work for some reason, perhaps consider a stream processing job that dedupes.