We are having issues with our Coordinator master not running a coordination cycle on the 60s interval. We have witnessed anywhere from a few minutes between runs up to 15 minutes and over.
I have tried a few things, but am still seeing the problem:
- I reduced the number of segments to move dynamic config to 5 from 50
- Since the coordinator is co-located with a historical node (we have beefy hardware), I made sure that the processing threads for the historical was well below what is on the server to confirm coordinator wasn’t waiting for cpu.
- Confirmed that there isn’t excessive gc pauses for the coordinator
I haven’t had time to start digging through the actual code to see what the coordinator lifecycle looks like, but this feels to me like something long running is preventing the DruidCoordinatorSegmentInfoLoader from running.
I know that I haven’t been able to provide many details, I am still getting into diving into the problem and wanted to start a conversation here in case anyone has seen the same issue and could get me on the right trail for a solution.