Ingestion creating duplicates

Hi,
I am using druid 24.0.0 and ingesting data using multi-stage query.

I am facing an issue where the ingestion creating same record twice as shown below:

I cross verified with my source and it has only single record.
Not sure what is causing the duplicates. I tried deleting segments from S3 and from druid console and did a fresh ingestion but then also facing the same issue.
Please let me know what is going wrong.
Also note that i am NOT setting appendToExisting

please paste the query you are using to ingest

Please find below the query:

REPLACE INTO fact_extrapolated_revenue OVERWRITE ALL  WITH fact_extrapolated_revenue_table AS
 (SELECT * FROM TABLE(   EXTERN(     '{\"type\":\"s3\",\"prefixes\":[\"s3://data-dev/druid_poc_1/fact_extrapolated_revenue\"]}',     '{\"type\":\"csv\",\"findColumnsFromHeader\":true}',    '[{\"name\":\"product_id\",\"type\":\"string\"},
 {\"name\":\"facilitytype_id\",\"type\":\"long\"},
 {\"name\":\"month_end_datekey\",\"type\":\"string\"}
 ,{\"name\":\"dist_total_units\",\"type\":\"long\"},
 {\"name\":\"dist_total_revenue\",\"type\":\"long\"},
 {\"name\":\"modified\",\"type\":\"string\"}]'   ) ))
 SELECT   coalesce(TIME_PARSE(fez.month_end_datekey,'yyyyMMdd'),current_timestamp) AS __time,   EXTRACT(MONTH from coalesce(TIME_PARSE(fez.month_end_datekey,'yyyyMMdd'),current_timestamp)) AS sale_month,   EXTRACT(YEAR from coalesce(TIME_PARSE(fez.month_end_datekey,'yyyyMMdd'),current_timestamp)) AS sale_year,   EXTRACT(QUARTER from coalesce(TIME_PARSE(fez.month_end_datekey,'yyyyMMdd'),current_timestamp)) AS sale_quarter,   fez.product_id,   fez.facilitytype_id,    fez.dist_total_units,   fez.dist_total_revenue   FROM fact_extrapolated_revenue_table fez   PARTITIONED BY YEAR

Do you have any other folder with similar name? Basically data will ingested from fact_extrapolated_revenue*

It might be worth doing a druid batch ingestion with the same prefix and checking the result.

Hi @Vijay_Narayanan1 these are the various folders:
image

Another point to note is that when i ingest the data as parquet file, i am getting the correct result.

are you trying to ingest all four folders?

can you rename fact_extrapolated_revenue to fact_extrapolated_revenue_csv and then use that in your ingestion query? Note fact_extrapolated_revenue_parquet will pick up pnly that folder but fact_extrapolated_revenue will pick up all four

1 Like

Thank you @Vijay_Narayanan1