[9.1.1] Ingested segment not being picked up (production outage!)

Looking at a situation in our cluster right now where a segment that we ingested (via batch ingestion) at around 4AM UTC is still the one loaded up and being served, despite the fact that we have 10+ apparently successful ingestions for the same segment since that time. In other words, we have all those ingestions that look successful in the overlord console but in the coordinator console, the version for that segment is 4AM.

I’m looking in the coordinator logs but not really sure what to look for. I see there are log lines where the message starts “io.druid.server.coordinator.DruidCoordinator - Removing Segment” which look to me like they happen when a newly ingested segment replaces a previous version. Those log lines stopped appearing in the logs sometime after 5AM UTC, which seems pretty suspicious. But I don’t know what else to look for.

Any help with what else to look for or any ways to fix would be really appreciated. This is currently causing a partial production outage for us.

Try tracing through the segment loading path and see where it breaks down, through inspection of deep storage/metadata store and logs:

  • segment files pushed to deep storage

  • segment record written to metadata table

  • coordinator polls record from metadata table

  • coordinator identifies a historical node that should load the segment, based on load rules, tier, capacity, etc

  • coordinator instructs historical node to load the segment

  • historical node loads the segment

Ok, I had gotten as far as the segment files being pushed to deep storage, and they are.

I’m not sure where to look for the other things… coordinator logs? directly in the database for metadata table?

Yeah, for “segment record written to metadata table” check the metadata table and confirm the record exists with used = true. For the things downstream of that, the coordinator and historical logs will have the info.

I do see all the ingested segments showing up in the druid_segments table in the metadata db. Is there anything else to check there?

I notice that the “used” flag is true for multiple entries there. All the segments from the “stuck” 4AM one on have used=true, while the segments ingested earlier than the 4AM one have used=false.

Ok, still fumbling somewhat blindly in the logs, but I see that the coordinator logs have an output line like “io.druid.metadata.SQLMetadataSegmentManager - Polled and found X segments in the database” which occurred regularly about every 5 minutes, but stopped appearing after a few times in the 5AM UTC log. It has not reappeared since then.

Log lines “SQLMetadataRuleManager - Polled and found rules”, “DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.”, and “io.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found available segments.” are still occurring.

I appreciate the help… would it make sense to take this to the IRQ channel maybe? Although, I’m going to be offline for the next hour starting in a minute.

Maybe this is a clue or maybe meaningless… but I found that while the leader coordinator stopped emitting “io.druid.metadata.SQLMetadataSegmentManager - Polled and found X segments in the database” log messages, one of the non-leader coordinator instances is still emitting it. In fact, it was emitting it at a relatively lower rate (10 or 11 times per hour) for the past couple of days, then after 5AM UTC when the leader stopped emitting it, it suddenly started being emitted 60 times per hour on the non-leader. No sign in the logs that this non-leader is really telling historicals to add or drop anything, though. (The leader is still telling historicals to add and drop segments but as far as I can tell they are all moves rather than new segments)

I’m going to try restarting the coordinator(s) and see if that helps.

Restarting the coordinators fixed it. That gets us out of our outage, hoorah. Seems like there might be a pretty bad bug lurking however. If you have any other ideas what to look for in the log I’d be happy to follow up.

One other thing I noticed, after restarting the coordinators I went to the coordinator console and it was showing a number of recent dates having more than one “shard” in the data sources view. Our cluster is not sharded. When I clicked on today’s segment, it showed a list of 20 different segment versions which appeared to correspond with all the segments that had been ingested since that stuck 4AM one (inclusive). I.e., all the segments where “used = true” in the druid_segments table. After waiting a few minutes that disappeared and all the segment counts went down to 1. Looking at druid_segments now, there is only one segment for the date where “used = true”.