We’re experiencing an issue with our druid cluster that we’re having some trouble resolving.
We have a fairly large set of tasks (30ish) that are indexing with Tranquility at a six hour granularity. At each granularity boundary time (00:00, 06:00, 12:00, etc.) we get a brief blip where a large number of our queries time out, due to some increased latency during this roll-over time. There are no errors or warnings in any of the logs at this time, nor does it appear to be GC-related. The impact lasts about a minute.
We have tried tweaking warming period and window period, but this does not seem to have much effect. We actually have ended up with code that adds some randomness to warming and window period so that the task start up time and handoff times are staggered, and the load on the indexing service is spread out over a longer period. This seems to keep the cpu usage down on the middle manager during the switch, but it still does not mitigate the blip we see for a minute exactly at the grain time.
We’d like to solve this problem, as it translates into several hundred timeouts for end-user requests at each of these times, each day.
Has anyone experienced this, or have any troubleshooting ideas?