Setting when Druid writes to deep storage (i.e. S3)


Is there a config that can be adjusted when Druid writes ingested data to deep storage? We are reaching capacity in our historical nodes and would like to increase the frequency of writes to S3.

Assuming you are talking about Realtime Ingestion, you could just reduce the task duration to a smaller period and have it write more often. Often taskDuration is set to PT3600S (or one hour) you could reduce this to say PT1800S and have it publish the segment every half an hour instead.

When you say capacity …do you mean storage on the historical? You may want to set retention rules so that only the most frequently used intervals are retained in the historicals

Hi Vijay - where can I find this setting?

Here’s a handy link: Tutorial: Configuring data retention · Apache Druid

You can set the retention rules here:

Please note if you drop these intervals you will not be able to query them.

Hi Vijeth,

so the taskDuration parameter is the straight anwer to my question? It publishes to deep storage based on this setting?

What is the issue? Running out of storage capacity on the historicals can’t be solved by speeding up ingestion. If you need less data stored on the historicals then retention rules are the way to go. If you want to query all this data then you need to add disk or add more historicals. Reducing the task duration will ensure that the tasks publish to deep storage faster but won’t affect the overall amount of data in the historicals.

Hi @magnusarcher Vijay is right in that if you are running out of space in the historicals, you will need to either increase the storage/nodes or reduce the amount of data to be stored.

For the latter the simplest thing would be to drop the older data, but that will render the dropped data un-queryable.
You could try compaction to see if you can get better compression of the data and save storage (I’ve see range partitioning help in this regard, but the improvements are very dependent on your data)

Even if you publish the segments sooner, they will then be loaded back into the historical for querying, so you will run into the same issues you are trying to solve.

Thanks Vijay, Vijeth – it does appear we need to reduce the retention as the historical nodes are getting evicted in loading the huge segment size.

ok…this has opened another train of thought. When yous ay huge segment size, you mean individual segments are large or that overall data size is large? If individual segments exceed 1-2 GB then you should change the max rows in segment in the ingestion to reduce this