Is there a way to prevent duplicate ingestion from Kafka

Hi all,

I have a table in Hive that I would like to ingest to Druid, and using the kafka-ingest for that.

It is working well, but the problem is rows are duplicated.

In fact, from the Kafka input side, I cannot deduplicate the records, so some records are duplicated in Kafka!

However, can Druid handle this and ingest only unique rows based on a column for example?


Druid will load exactly what is in Kafka (see for a description of how it works). This is true even if Kafka has duplicates!

IMO the solution is Kafka’s idempotent producer feature, described here: Druid doesn’t support it yet, but we will as soon as this patch is done: