Is there a way to prevent duplicate ingestion from Kafka

Hi all,

I have a table in Hive that I would like to ingest to Druid, and using the kafka-ingest for that.

It is working well, but the problem is rows are duplicated.

In fact, from the Kafka input side, I cannot deduplicate the records, so some records are duplicated in Kafka!

However, can Druid handle this and ingest only unique rows based on a column for example?

Thanks!

Druid will load exactly what is in Kafka (see https://imply.io/post/exactly-once-streaming-ingestion for a description of how it works). This is true even if Kafka has duplicates!

IMO the solution is Kafka’s idempotent producer feature, described here: https://kafka.apache.org/documentation/#upgrade_11_exactly_once_semantics. Druid doesn’t support it yet, but we will as soon as this patch is done: https://github.com/apache/incubator-druid/pull/6496