Coordinator re-assigns segments when historical goes down

Hi all,

We are running a Druid cluster with three historicals and a repfactor of 1.

When one historical goes down, we noticed that the following happens:

0 - historical1 goes down (obviously)

1 - after ~8 minutes historical2 and historical3 start loading vast amounts of data from the deepstore (supposably because the coordinator re-assigns the segments from the machine that went down)

2 - historical1 comes back up and starts loading vast amounts of data from the deepstore (supposably because the coordinator re-assigns segments to this machine but these segments are now (in part) different segments than the ones than the ones that historical1 was replica for previously)

The logs on the coordinator and historical tell us nothing with respect to all this process except for the following on the coordinator:

No available [_default_tier] servers or node capacity to assign primary segment[segment_name]! Expected Replicants[1]

It logs this line for a lot of segments.

This behaviour seems to contradict the Druid docs for multiple reasons:

  1. http://druid.io/docs/latest/design/coordinator.html#segment-availability suggests that there is some lag between the moment a historical goes down and the coordinator re-assigning the segment to another historical. However, I can not find any config parameter which allows to tune this.

TLDR;

  1. How can I configure the time-interval that the coordinator waits before rebalancing cluster state when a historical goes down?

  2. Will increasing the repfactor to > 1 ensure that segments are not immediately reloaded from deepstore when there are temporarily less than replicas for a segment?

1 Like

Hey Kees,

One thing I have seen be effective here is to set “maxSegmentsInNodeLoadingQueue” to some reasonable value, like a few hundred, which will prevent any historical’s load queue from getting too massive. It helps minimize churn in situations like this.

Increasing the replication factor would also help, because there is a throttle on second replications, but not on first (Druid prioritizes getting data available ASAP). But setting that maxSegmentsInNodeLoadingQueue property should effectively act as a throttle on first replications too.

Hi Gian,

Thank you for your answer. We will include this parameter in our config.

However, I am still curious about what the docs mean by:

"Given a sufficient period of time, the segments may be reassigned to other Historical processes in the cluster. " (ref: http://druid.io/docs/latest/design/coordinator.html#segment-availability)

More specifically: Is this “period of time” tunable? Is it otherwise possible to see what this period is?

Kees

1 Like