Re: [druid-user] Versioned data - using latest record

Hi Saurabh -

Not too basic, and you’re right, it’s different from RDBMS SQL. You can’t do updates on an existing record. You can overwrite them, but then everything is grouped by time intervals, and you’d have to overwrite everything for that time interval. (The storage is in immutable segments, which are arranged by time intervals, and can be entirely - or not at all - overwritten.)

With native druid queries, you can use the stringLast aggregator to get the latest value. With druid SQL, I guess you could probably do some subquery logic, but it might get a little messy or performance might suffer. Eg,

select * from myDataSource
where someKey = someVal
and “__time” =
(select max ("__time")
from myDataSource where someKey = someVal)

(I’m guessing there’s a better way - that just comes to the top of my head.)

Hey! Also note that you have the LATEST function (SQL · Apache Druid).

Please do also be aware that as Druid is partly a time-series database, the sense of “row-level updates” is not the same as you would be familiar with in a traditional database: updates to Druid data are performed for all data over an interval of time versus your traditional UPDATE command.

This is the query I came up with for this use case, where reprocessed_ts indicates when the record was ‘last loaded into druid’, so if its newer version of a previous message the inner query will select those.

select count(*) from dupe_test
where
company_name = ‘AcmeInc’
and __time >= ‘2021-03-11 01:00:00’ and __time < ‘2021-03-11 02:00:00’
and reprocessed_ts in
(
select latest(reprocessed_ts)
from dupe_test
where company_name = ‘AcmeInc’
and __time >= ‘2021-03-11 01:00:00’ and __time < ‘2021-03-11 02:00:00’
group by message_id
)

however it feels like I may need to have the inner query concat message_id+reprocessed_ts, to ensure it matches properly on the outer query.
This feels like it should perform better than using a temp table…

FYI I’ve seen IN take more than one parameter - like where (x1, y1) IN (SELECT x2 y2 FROM bananas) - maybe you could use that for your compound?