Druid GCS deep-storage space usage

Hi all,

We are using GCS as a deep-storage, with the google extension.

Druid cluster contains around 80GB of data (non-replicated). However the deep-storage GCS buckets takes 1.5TB of space. We often replace last few months worth of data. Is this large use of GCS space expected? Can I reduce this size somehow?

Storing 1.5TB is not an issue, however I also need to backup the deep-storage and would like to keep a copy for every day from last couple weeks, which will quickly increase the size of a backup storage.




I have the same behaviour and found the conclusion that it was due to how Druid store its segments.

It never delete any data but write a new one and keep this new reference as current data.

So everytime you reload your datasource, you write new segments and so increase gcs usage.

To remove old segment version, you can run kill tasks (which remove segments Mark as unused)


Thanks very much Guillaume! Exactly what I was looking for