From time to time we have seen 1 or 2 historical nodes in our cluster perform poorly for a few minutes and then recover. This is to be expected due to AWS issues, bad luck, whatever. However, we were surprised at what an impact a single slow node had upon the entire cluster performance.
Since most of our queries have high fanout, it’s clear that there is a high probability of every query hitting a slow node for at least one segment. On the other hand, we also have a high replication factor, so it’s also (in theory) possible to satisfy all queries by temporarily ignoring slow nodes. It doesn’t look like there is any code in the brokers to deal with poorly performing nodes. It seems that they are either in or out depending on whether they are still talking to zookeeper.
We are wondering if any work has been done in this area. It seems like there is opportunity to optimize this behavior for larger clusters.
Thanks for any thoughts.