Maintain latest data for the druid source

Hi
I’ve created a datasource based on Kafka topic in Druid and its ingesting data currently.

I have set Append to existing option to false. I’ve also opted for no rollup as I need transaction level data. However as I see, its duplicating the data rows for each changes happening rather than overwriting it with latest changes

Can you please tell me how I can configure it to only store latest data instead of storing entire history of changes?

Thanks in advance

Visakh

Hi Visakh,

Do you mean there are multiple segments created for the same interval in the historical nodes? Can you provide more information on the issue you are referring to?

if appendToExisting in the ioconfig is set to false, then the segments with the same interval will be updated with the latest segments. You can see this in the druid console under segments and check number of partitions created for a given interval.

https://druid.apache.org/docs/latest/ingestion/native-batch.html#ioconfig

Thanks,

Hemanth

Hi Hemanth
Thanks for the quick revert

Nope the segments doesnt overlap in intervals by themselves.
As I see the data updates are happening to same row over a period of time which falls under multiple segment intervals. Hence I guess this is getting picked up by multiple segments and ends up as multiple rows in the druid source.

My question is whether its possible to maintain single row of the data in the latest state based on key columns? I cant do rollup here as I need transaction detail level data with exact timestamp as well.

Regards

Visakh

Hi Visakh,

From the RDBMS point of view, it is the same row being updated, But In my opinion, As Druid is reading it through your KAFKA topics whatever comes through the Kafka topic as a new message it will be read as a new record in druid .

Druid basically doesn’t see if its the same row being updated, However, if you are interested in the lastest updated row based on any key column you can get while querying the data source ( filtering on the latest timestamp).

Maybe, there could some better solution to it, let’s wait if someone has a better way to handle this in druid in the community.

Thanks and Regards,

Vaibhav

Hi Vaibhav
Thanks for the response

Yes. that makes sense.

I was expecting there would be some feature within druid to get data updated into single row based on the key.

Sure will wait to see if we have any cool ways to deal with this

Thanks once again

Visakh