Added new historical nodes. Segments load balancing not happening

Hi Team,

We had 10 histroical nodes with roughly 26 TB of data earlier. We added 10 more historical nodes a few days back to provide more querying power. But we are not seeing segments balnced on all 20 hitorical nodes evenly.

Please see below. The segments are not evenly distributed.
1.png
2.png
Below is coordinator dymanic configuration:
3.png

Please let me now if aything is wrong with the above coordinator cluster configuration which slows downs segments load balancing. Or there is something else I can do to improve segment balancing across historicals. We are using 0.14.2 version of Druid.

Regards,
Vinay Patil

You want to set your maxSegmentsToMove higher. That’s a value set to avoid stampedes. The “correct” value depends on how large you keep your segments and how many you have. It throttles the rate at which balancing happens so if you want to balance faster trying setting it to a higher number. You can set it back down after or keep it higher if you like.

1.png

3.png

2.png

That helped! Thank you.

@vinay,

Though i specified maxSegmentsToMove to 500 & balancer computing threads 4 in coordinator dynamic configuration in console, segment balancing is still slow (throttled at 5 at a time).
Can we specify this property in runtime.properties?
If yes, which option below we need add?

druid.coordinator.maxSegmentsToMove=500’
OR
’druid.maxSegmentsToMove=500
OR
maxSegmentsToMove= 500

I have tried first two option and restarted druid. But still segment balancing is slow.

Thank you,
Vinay

Hey J - this config is in the metadata database - see https://druid.apache.org/docs/latest/configuration/index.html#dynamic-configuration

As per that doc if you do a simple GET it’s worth checking that your config has taken effect:
http://<COORDINATOR_IP>:/druid/coordinator/v1/config

There is config that determines how often the coord does stuff, including rebalancing - worth checking that:
https://druid.apache.org/docs/latest/configuration/index.html#coordinator-operation

Hope this helps…?!?

Hi Jay,

I updated the dynamic coordinator config from Coordinator UI to the below, and that sped up the balancing of segments.

balancerComputeThreads - 10

maxSegmentsInNodeLoadingQueue - 2000

maxSegmentsToMove - 2000

replicationThrottleLimit - 1000

Rest of the fields in the configuration were not updated i.e. did not change the default values for those

Regards,
Vinay Patil

Thank you Vinay.
I will try this combination.
Btw how much time it took to rebalance for your data?

Previously I tried only maxSegmentsToMove to 100 in coordinator dynamic configuration UI which didn’t honor …
Then increased balancerComputeThreads to 10 & replicationThrottleLimit to 100. dint work either…

Will update the result.

Thanks
Jay

Vinay,
I have tried all four dynamic configuration parameters unfortunately segment balance didn’t pick the speed.

Then i have tried changing the below runtime parameters of coordinator. Then segment balancing rate improved slightly. When i bumped the values form 30 to 100 , balancing rate didn’t improve further.

druid.coordinator.loadqueuepeon.type=https (previously curator)

druid.coordinator.loadqueuepeon.http.batchSize=20 (from 1)

druid.segmentCache.numLoadingThreads=20

In this cluster overlord and coordinator is running on the same node and port though there is no ingestions happening currently to impact the rebalancing rate.

Jay

Thank you Peter.
I did use the ‘GET’ in postman and could see the configuration change is getting applied. But there is no impact on historical rebalancing.
i have tried changing below run time properties but still no impact.

druid.coordinator.loadqueuepeon.type=http (from curator)
druid.coordinator.loadqueuepeon.http.batchSize=10
druid.segmentCache.numLoadingThreads=20

I have 3 hist nodes of i3.4x with cluster replicants=2.
Thanks,
Jay

The above configuration worked for me. I remember running into an issue where segment was stuck during load balancing, and that slowed down the process. Check the coordinator logs for any such error. I had to restart that specific historical node to get the load balancing done faster.

https://support.imply.io/hc/en-us/articles/360041771213-Balancer-move-segments-queue-has-a-segment-stuck