I have a table in Hive that I would like to ingest to Druid, and using the kafka-ingest for that.
It is working well, but the problem is rows are duplicated.
In fact, from the Kafka input side, I cannot deduplicate the records, so some records are duplicated in Kafka!
However, can Druid handle this and ingest only unique rows based on a column for example?
Druid will load exactly what is in Kafka (see https://imply.io/post/exactly-once-streaming-ingestion for a description of how it works). This is true even if Kafka has duplicates!
IMO the solution is Kafka’s idempotent producer feature, described here: https://kafka.apache.org/documentation/#upgrade_11_exactly_once_semantics. Druid doesn’t support it yet, but we will as soon as this patch is done: https://github.com/apache/incubator-druid/pull/6496