Coordinator emitting that segments are unavailable and under-replicated

Hi everyone,

Not sure if there is some sort of problem, let me describe what I am seeing.

Periodically, maybe 2 times per day, the emitted metric for “segment/unavailable/count” emits “4” for 2 consecutive minutes, and “segment/underReplicated/count” emits “8” for the same 2 consecutive minutes. (The coordinator loop runs every 1 minute). This happens right after our kafka-indexing-service tasks publish their segments. We have taskCount=2 and replicas=2 (on two middle manager machines), so there are 4 segments total that attempt to be published, so it matches the unavailable and underreplicated metrics.

Most of the times these metrics are at 0, as they should be. Our indexing tasks have 1hour duration each, so most of the time the segments are handed off without any issue. As I understand it, the middle-manager peon process should wait until the coordinator confirms that a historical has picked up the segment before the peon shuts down.

We are on Druid 0.12.3

Does anyone have ideas for what to check further to figure out why this might be happening? Or is it expected sometimes? I really want to make sure we avoid returning incorrect query results during those 2 minutes that the new segments don’t appear to be loaded.

Thank you,

Michael

HI Michael,

In the case of Kafka indexing, your queries will be served direct from middle manager peons unless segments are successfully published and loaded by Historicals hence there is no chance of getting incorrect results if the segments are not published by that time when you run the query.

Thanks,

Vaibhav

Right, that’s what I thought. I am just surprised about the metric of 4 unavailable and 8 underreplicated segments - I wonder if something is not working right and those new segments aren’t always available on either the middlemanager or historical node?

HI Michael,

In my understanding, It does not indicate an issue this just means segment replication is yet not finished.

Essentially, the Druid Coordinator process is primarily responsible for segment management and distribution. More specifically, the Druid Coordinator process communicates to Historical processes to load or drop segments based on configurations. The Druid Coordinator is responsible for loading new segments, dropping outdated segments, managing segment replication, and balancing segment load.

To understand more on what exactly happening, Could you attach below :

  1. Coordinator.log: where you see under replicated messages.

  2. Coordinator configurations

  3. Load rule set for the data source under consideration.

Thanks,

Vaibhav

HI Michael,

Your understanding is correct!

The coordinator currently knows nothing of real-time tasks, it just sees the segment which appear in the metadata store and Those metrics are effectively recording what the coordinator is watching for to know segments, “new” and “missing” both.

The Druid Coordinator process is primarily responsible for segment management and distribution. More specifically, the Druid Coordinator process communicates to Historical processes to load or drop segments based on configurations. The Druid Coordinator is responsible for loading new segments, dropping outdated segments, managing segment replication, and balancing segment load.

In my opinion, You should not worried about these messages in the coordinator log.

Hope this helps.

Thanks and Regards,

Vaibhav