After bringing a historical node back online after a short (less than 30m) outage, the coordinator logs show a spike in log events like
Dropping segment [<segment>] on server [<server>] in tier [<tier>].
- I shut down historical-0 and apply OS updates. This takes less than 30m.
- I start historical-0 back up.
- A significant spike in the number of aforementioned log events.
It appears that this may (I can’t tell for sure) block “normal” load/drop queues for datasources. At the very least it slows them down.
The Segment availability section states that the coordinator will treat segments as dropped on a historical node, “[g]iven a sufficient period of time”.
So, here’s my many questions:
- Why is this happening?
- Should I do something to avoid it, or is this normal?
- If I should avoid it, what should I do?
- Should I be doing something to tell the coordinator ahead of time that the historical is only going down temporarily?
- Is the “sufficient period of time” a configurable value?
Things I've tried
Relates to Apache Druid v0.20.2
- What I think you are experiencing is Druid trying to rebalance the cluster because it sees the Historical node as being “down” and is trying to make sure that the replication remains correct.
- This is normal, expected behavior
- I haven’t found anything in the docs about this, but will ask around.
I was looking at the docs around rolling upgrades, because that is a similar action to see if there was anything in there that would allow the cluster to be told, in essence, to chill during a rolling upgrade and not try to rebalance itself.
Thank you for the quick response, @Rachel_Pedreschi! Let me know if you learn anything else from asking around.
I found a related thread here: Coordinator re-assigns segments when historical goes down.
In that thread @Gian_Merlino suggests setting
maxSegmentsInNodeLoadingQueue to 100. This Imply blog post makes a similar suggestion.
maxSegmentsInNodeLoadingQueue The default value to this is 0 which is unbounded. It is always a good idea to cap this at a certain number so segment publishing is controlled at a rate the network can support. You can start with 100 and increase from there.
I’m going to give that a try next time I take a historical down. I’ll report back afterwards.