Avoiding rebalance on historical restart?

Hey there,

Are there any suggestions on how best to avoid the coordinator assigning segments to other Historicals during upgrades? Our current process means when performing a rolling upgrade we have to wait few minutes after upgrading an historical until all segment loading has settled down so HDFS isn’t overloaded.

I guess the most straightforward solution is to simply shutdown all Coordinators for the duration of the upgrade but this may take some time and prevent real-time tasks from handing off.

The following docs suggest there’s some sort of lag time between an Historical being offline and the Coordinator reassigning those segments.

http://druid.io/docs/latest/design/coordinator.html#segment-availability

Would anyone be able to help with following:

  • Does this lifetime also include datasources that aren’t replicated? Or will they immediately be reassigned regardless of the lifetime?

  • Where is the lifetime configured or what is it set to?

I’ve tried my best to look through the coordinator’s code but haven’t any luck, if anyone has any pointers for specific classes to look like that might help with the above that’d be very much appreciated!

Kind regards,

Dylan

Hi Dylan,

I’m not sure if that section on “Segment Availability” is still accurate, I didn’t see any related configs and I don’t recall there being such logic in the code presently. I’ll file an issue to check on that.

One thing you can do to control the rate of segment replication when a historical goes down is to set the replicationThrottleLimit in the coordinator dynamic config: http://druid.io/docs/latest/configuration/coordinator.html

This determines how many segment replicants can be created in a single coordinator run period, so if you set this to a low value relative to your total number of segments, this would reduce the churn during historical restarts.

Does this lifetime also include datasources that aren’t replicated? Or will they immediately be reassigned regardless of the lifetime?

If the datasource isn’t replicated, I believe segments would be reassigned immediately.

Thanks,

Jon