Rejection of Duplicate Data

Hi Team, Is there any possibility/settings through which i can reject the events which are duplicate ( i needed druid to reject the event if it;s already been inserted before).

We have roll-up function as of now and true/false none of them provide this functionality to reject the duplicate event.

Ø If you will keep the flag rollup : true (default is
true) in specification then it will do roll up for milliseconds events
(will do for the events of the same timing).

If if you will keep the flag rollup : false in
specification it will keep multiple entries with the same timings if you are
putting it twice.

Regards,

arpan Khagram

dedup has to be done before Druid. Are you ingesting from Kafka or batch?

Hi - I am Ingesting via KAFKA through KAFKA Indexing Service.

As this document explained Kafka indexing service guarantees exactly-once semantics.

http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html

Hi, I don;t think it has to do with exactly-once insertion through KAFKA.

What is i try to insert the same data row again through kafka, by any how druid reject the row having same timestamp and dimension values ?

I don’t want to do roll up or keep multiple rows having same data.

Regards,

Arpan Khagram

As kenji pointed out, deduplication has to be done before druid in your ETL workflows.

Hi Nishant, Thanks for replying. I understand the feature not there as of now to reject the event but don't you think it's necessary to have ( at least at the time of creating a segment if not at runtime).

Regards,
Arpan