Coordinator not running cycle every minute as configured

Hi,

We are having issues with our Coordinator master not running a coordination cycle on the 60s interval. We have witnessed anywhere from a few minutes between runs up to 15 minutes and over.

I have tried a few things, but am still seeing the problem:

  1. I reduced the number of segments to move dynamic config to 5 from 50
  2. Since the coordinator is co-located with a historical node (we have beefy hardware), I made sure that the processing threads for the historical was well below what is on the server to confirm coordinator wasn’t waiting for cpu.
  3. Confirmed that there isn’t excessive gc pauses for the coordinator

I haven’t had time to start digging through the actual code to see what the coordinator lifecycle looks like, but this feels to me like something long running is preventing the DruidCoordinatorSegmentInfoLoader from running.

I know that I haven’t been able to provide many details, I am still getting into diving into the problem and wanted to start a conversation here in case anyone has seen the same issue and could get me on the right trail for a solution.

Thanks,

Lucas

Hey Lucas,

Some things that come to mind might be an excessively large number of segments (many hundreds of thousands or millions?), networking issues (trouble talking to ZK or metadata storage), or an overloaded (or high contention) MySQL/PostgreSQL that is taking a very long time to return responses to queries.

Do you see logging activity happening during the whole 15 minutes between runs? What is the coordinator doing?

Especially if you see long pauses where there’s not much logging happening, it would be helpful to run jstack to get a few thread dumps of the coordinator. Look for the coordination thread and see where it’s spending its time.