Duplicate data handling during batch ingestion

Are there any ways to avoid duplicate data ingestion in druid ? I could’t find any from the documentation.
Also, what metric spec should be mentioned to count the unique values of a particular dimension.?

Hi Manish, if you use batch ingestion, it should be 100% accurate in terms of the data you put in. If you are looking for exactly once streaming ingestion, we are working towards this for Kafka->Druid, and you should follow this PR:

Hi Fangjin,

Hi Manish, can you do the de-duplication as part of your ETL layer? Druid doesn’t have anything native built in to handle de-duplication. When you reindex data in Druid, it creates new versions of segments that obsolete older versions of segments for the same interval of time.