I’m working on a cluster for which storage requirements change over time. When running out of disk space, it’s easy to get a new node and let the coordinator balance the load. However, when there’s too much free disk space and a historical node needs to be decommissioned, I don’t know what would be the right way to do it.
Currently, I’m just terminating one of the instances and that works well except for the fact that some data becomes unavailable until data has been loaded into the remaining nodes. Is there any way to tell the coordinator to stop using a historical node so that it can be safely terminated once it is not in charge of any data?
I’m far away from being a druid-poweruser, so there might be a better approach , but what you can definitely do is using replicas.
For instance, if you set a replica of 2 (means each segment/shard is on 2 historical nodes) on your datasource, each segment will be put onto two historical nodes, instead of the usual one. This way, if one node goes down, the data can still be served as there is one node still alive which has the specific data.
This way you can replace all your instances in a rolling fashion without ever having your datasource be unavailable. (You just need to wait till all the segments from the decommissioned machine are again replicated (check coordinator whether there is a load-queue) and continue with your next node.
Of course this setup requires double the amount of space in general, but is a safer and more reliable way to preserve data you want to serve.
But I’m eager to hear other solutions in that regard, as well