I’m seeing some duplicated data in our production system and have a doubt:
If a Hadoop batch task fails (some of our tasks have failed on some reducer phases), is there the possibility that some data may have been introduced into Druid?
No, the segmentMetadata for a batch index job is committed to metadata to metadata storage in a single transaction after the MR job completes.
Also, even if multiple batch tasks ran over the same data, the segments from newer job will override previous one.
I would check the data being ingested into druid to see if that has duplicates somehow.
Hi Nishant, thanks. It was our ETL process which was adding duplicated rows.