Adverse effects from single slow historical nodes

Hi all

From time to time we have seen 1 or 2 historical nodes in our cluster perform poorly for a few minutes and then recover. This is to be expected due to AWS issues, bad luck, whatever. However, we were surprised at what an impact a single slow node had upon the entire cluster performance.

Since most of our queries have high fanout, it’s clear that there is a high probability of every query hitting a slow node for at least one segment. On the other hand, we also have a high replication factor, so it’s also (in theory) possible to satisfy all queries by temporarily ignoring slow nodes. It doesn’t look like there is any code in the brokers to deal with poorly performing nodes. It seems that they are either in or out depending on whether they are still talking to zookeeper.

We are wondering if any work has been done in this area. It seems like there is opportunity to optimize this behavior for larger clusters.

Thanks for any thoughts.

Hey Max,

First try setting druid.broker.balancer.type=connectionCount on the broker. This helps minimize damage done by slow nodes, since they naturally have a high connection count, and the broker would route around them.

Charles also recently raised a suggestion to implement gray listing: https://github.com/druid-io/druid/issues/3449