Outage caused by historical miscalculating remaining disk capacity - segment too large for storage

After seeing this error in Druid historical

Caused by: com.metamx.common.ISE: Segment[timeseries_dogstatsd_counter_2018-04-04T16:00:00.000Z_2018-04-04T17:00:00.000Z_2018-04-04T16:00:00.000Z_1528210995:152,770,889] too large for storage[/var/tmp/druid/indexCache:22,010].

, we notice that the historical node stops loading new segments from realtime and the realtime nodes starts accumulating segments.

Our maxSize settings goes like this and we had enough free disk space.


Restarting Druid historical fixes the issue. We suspect that there is something going wrong with how Druid calculates the available size i.e., 22,010.

We created an issue for this on Github about a month back (https://github.com/druid-io/druid/issues/5577). It has reoccurred again and caused an outage because there were many unhanded off segments in Druid realtime.