Replicant create queue still has 10 segments and gets stuck after 15+ runs

Hey, I added third historical node, and kept replication factor at 2, rebalancing finished successfully, but now I’m still getting these errors :

2017-11-01T15:01:42,552 ERROR [Coordinator-Exec–0] io.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue stuck after 15+ runs!: {class=io.druid.server.coordinator.ReplicationThrottler, segments=[gwiq-p_2017-10-28T14:00:00.000Z_2017-10-28T15:00:00.000Z_2017-10-28T17:36:32.581Z ON 172.31.7.89:8083, gwiq-p_2017-10-28T09:00:00.000Z_2017-10-28T10:00:00.000Z_2017-10-28T12:33:44.415Z ON 172.31.7.89:8083, gwiq-p_2017-10-28T12:00:00.000Z_2017-10-28T13:00:00.000Z_2017-10-28T15:35:45.717Z ON 172.31.7.89:8083, gwiq-p_2017-10-28T06:00:00.000Z_2017-10-28T07:00:00.000Z_2017-10-28T09:34:44.547Z ON 172.31.7.89:8083, gwiq-p_2017-10-28T05:00:00.000Z_2017-10-28T06:00:00.000Z_2017-10-28T08:30:09.589Z ON 172.31.7.89:8083, gwiq-p_2017-10-28T07:00:00.000Z_2017-10-28T08:00:00.000Z_2017-10-28T10:34:23.012Z ON 172.31.7.89:8083, gwiq-p_2017-10-28T02:00:00.000Z_2017-10-28T03:00:00.000Z_2017-10-28T05:27:34.225Z ON 172.31.7.89:8083, gwiq-p_2017-10-28T04:00:00.000Z_2017-10-28T05:00:00.000Z_2017-10-28T07:28:56.744Z ON 172.31.7.89:8083, gwiq-p_2017-10-28T13:00:00.000Z_2017-10-28T14:00:00.000Z_2017-10-28T16:36:35.438Z ON 172.31.7.89:8083, gwiq-p_2017-10-28T01:00:00.000Z_2017-10-28T02:00:00.000Z_2017-10-28T04:28:19.409Z ON 172.31.7.89:8083]}

Segments seem to be dropped and assigned like in a live lock. We have 25 000 of segments, which is too much, I currently try to reduce it by druid.coordinator.merge.on

But until then I would need to solve this, is there a coordinator configuration that would prevent this live lock caused by too many segments?

Which version are you on? Later versions have a lot of improvements related to better handling larger numbers of segments. So if you aren’t on the latest I would try upgrading and see if that helps.

We use 0.10.1, but it is the first time since we have more historicals than replication factor. So the problem is probably there.