Coordinator Taking over 30+ minutes to Show up on UI and Segments availability after ingestion is painfully slow

Hi All,

I’ve started seeing issue where it was taking hours for segments to be available after ingestion. I restarted the coordinator and it took 30+ minutes to show up on UI after that. I am using ZK quorum of 3 nodes, in logs I didnt see any exception and after service restart within secs there was a log of successful connection with ZK.

I restarted Query and overlord just to see if there is ZK issue but these services were immediately available after restart. My cluster is running for more than 1 year now with same machines.

Any suggestions what else I can try? Or I should create the new cluster with new machines?
If I create the new cluster, to load all the previous data should I just use older RDS and it will take of loading the data? or I need to do something special?
Its production issue so any quick help would be really appreciated.

Thanks in advance.

I’d start with the coordinator and historical logs, to see what is going on from a segment loading perspective.

In the logs look for where is the coordinator waiting or taking a long time at startup?
How much data is being loaded? How many new segments are there?
How saturated is the active coordinator in terms of CPU/Memory?

I checked the logs no exception observed and its a dedicated host with very less CPU and memory usage

I wasn’t thinking exceptions, but rather reading through the initialization of the coordinator to see where it is taking long.
How many segments are created with the ingestion. How long are they taking to be available after ingestion?

Did anything else change? Was this a gradual slow down for segment availability time or a sudden change in behavior?

Another consideration might be to look at zookeeper utilization levels. By default load/unload instructions from the coordinator to the historicals occur through ZK. There is an option to change this by using druid.coordinator.loadqueuepeon.type=http such that zookeeper is no longer the middle man between coordinator and historicals. Also look into Configuration reference · Apache Druid, to drive larger batches of load/unload instructions using this mechanism.