In druid , on 1 dataset we have 2400 segments. We have 30 datasets of such. On a daily basis, we see the records belonging to these 2400 segments get updated. The updated records are very low ( < 0.1 % ) , but it spans across each segment. Due to this we end up in doing backfill of all datasets on a daily basis.
We are doing batch Ingestion using index_parallel type on a daily basis. The dataset that we load into druid from SnowFlake ( where we have enterprise wide data ) gets updated to even 20 years back on a daily basis. The total records that got updated would be less than 1% of the total records in the table. But this 1% of updated data spans across all segments in druid. So we are doing a daily backfill of the entire dataset in druid on a daily basis.
Due to the above use case, the cost of the druid cluster is shooting up due to the large number of Middle manager Nodes.
If there is a way to update only the records that got changed ( May be a SQL Merge kind of functionality ), this would be beneficial.