Druid - Kakfa: Duplicate message Issue


I have a druid setup with version 0.18.1.
I wanted to know how Druid handles duplicate messages?

for example -

We just started posting events data to Kafka topic from 1st Sep and having old data in tsv file that also have data till 13th Sep.

Can you please help me in understanding that how can I create a datasource which ingest real time data from Kafka along with all old data from tsv file and store only one copy of event data from 1st Sep to 13th Sep.

Thanks & Regards

Amit Srivastava

Hi Amit,
Druid will just treat the duplicate records as separate records and not de-duplicate. I’m not sure how many records you’re dealing with, but one way might be to query and filter down the TSV to exclude 1 Sep to 13 Sep in something like Apache Hive and then ingesting the new filtered file into Druid. Then turn on the Kafka feed and do realtime ingestion from there.
Does that answer your question?