Was applying a common.properties patch to all nodes in my druid cluster. I applied the update one by one across each node. When I applied the patch on the historical-hot nodes, the data usage across nodes jumped from 78% to almost 98%. I am applying the patch and then using ansible to restart and reload the service and waiting about 30 secs between each node to apply the patch. What can I do to avoid the data usage spike?
not sure what you mean by “data usage”. Are you referring to the amount of disk space used in a tier as shown in the coordinator console?
I’m not a Druid veteran, but perhaps the following might be helpful if not already known to you:
- historical nodes have a /druid/historical/v1/loadstatus endpoint: “Returns a flag indicating if all segments in the local cache have been loaded. This can be used to know when a historical node is ready to be queried after a restart.”
You could restart one historical, then wait for the endpoint to report that the node is fully up and then restart the next.
If you restart nodes too quickly, you might break k-safety and the coordinator might too quickly become active and rebalance data. Same thing if the historical is down for too long or doesn’t come back up quickly enough: the coordinator might think the node has died and decide to assign the segments served up by that node to other historicals. Configuring the coordinator so that it isn’t doing these checks too frequently is one option.
you could set the coordinator properties for segment rebalancing or replication to 0 before you perform an update and after the update set these two properties back to their original values again. The coordinator web console allows to change these properties dynamically.
alternatively, you could also simply shut down the coordinator service prior to the update and bring it back up afterwards. With the coordinator gone, no dynamic stuff can happen in your cluster.