[druid-user] Druid Row level update - Is there any possibility?


In druid , on 1 dataset we have 2400 segments. We have 30 datasets of such. On a daily basis, we see the records belonging to these 2400 segments get updated. The updated records are very low ( < 0.1 % ) , but it spans across each segment. Due to this we end up in doing backfill of all datasets on a daily basis.

We are doing batch Ingestion using index_parallel type on a daily basis. The dataset that we load into druid from SnowFlake ( where we have enterprise wide data ) gets updated to even 20 years back on a daily basis. The total records that got updated would be less than 1% of the total records in the table. But this 1% of updated data spans across all segments in druid. So we are doing a daily backfill of the entire dataset in druid on a daily basis.

Due to the above use case, the cost of the druid cluster is shooting up due to the large number of Middle manager Nodes.


If there is a way to update only the records that got changed ( May be a SQL Merge kind of functionality ), this would be beneficial.

you can load only the changed data and append to existing data(append to existing needs to be set to true in ingestion). In the query you can use LATEST/EARLIEST in druid sql to get the latest data for a transaction id for instance.