Diagnosing unavailable segments

We had some odd situations in our cluster that we thought we successfully repaired last week, but we’re left in a place where segment/unavailable/count doesn’t go below 11. We can see from the front page of the coordinator console (as well as the metrics) which data sources are contributing to this, but we’re not sure where to go from here.

Is there a way to figure out what segments are unavailable?

Is there a way to differentiate between “coordinator hasn’t come up with a good historical to ask to load them” and “coordinator has assigned them to a historical but the historical isn’t getting them”?

–dave

The /loadqueue API has nothing, so that implies this is “coordinator hasn’t come up with a good historical to ask to load them”, right?

Hmm. We looked at http://COORDINATOR:8081/druid/coordinator/v1/datasources/DATASOURCE/intervals/2017-10-08T21:00:00.000Z_2019-12-18T22:00:00.000Z?full (where that interval covers our entire load range and then some) and every returned segment has 2 (specified replication factor) historicals listed under ‘servers’ other than the segments currently being created. Is this surprising?

Is there a way to figure out what segments are unavailable?

You could do this by:

1.) In your metadata store, query the druid_segments table for segments where used=true and dataSource= (these are segments that should be loaded on historicals)

2.) Issue a segment metadata query (http://druid.io/docs/latest/querying/segmentmetadataquery.html) for the datasource in question, across an interval that covers all of the data for that datasource (these are segments that are actually loaded on historicals)

The segments that appear in the first query but not in the second query are the unavailable segments.

From there, I would search the historical logs for errors related to those unavailable segments.

Thanks, this looks like it should work.

(We’re running at a prerelease a bit before 0.13.0 that doesn’t quite have the sys table yet. Funny, I think I met the developer of that feature at the Lyft meetup and when she described it to me it seemed neat in theory but not something I’d actually want to use… nope :slight_smile: )

Looks like these segments are missing from deep storage, eek! And we don’t have GCS versioning enabled on that bucket… fixing that now. And hmm, now that I know what to look for, there are relevant errors in the historical logs.

I think we can use the coordinator disable API (DELETE /druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}) to just ignore these segments (or equivalently just set used to false?).

–dave

That was really helpful, Jon!

I think this insight might be helpful for my question at https://groups.google.com/d/msg/druid-user/OVJdDc9dGp8/1U4gL_laDgAJ

Specifically, it seems like if we ran our own monitoring task (ignoring Druid’s built in metrics for the moment) which does the SQL query:

SELECT datasource, COUNT(segment_id) AS unavailable FROM sys.segments WHERE is_available = 0 AND is_published = 1 GROUP BY datasource

then we would be able to publish a metric of “segments which are truly unavailable”.

But now I’m curious what makes you say that ‘the “is_available” handling is bugged in the current 0.13.0-incubating release and always returns 1’. (We’re on 0.13.0-incubating now :slight_smile: )

–dave

OK, after futzing around for a while, I think the trick might be

SELECT count(*) AS unavailable FROM sys.segments WHERE num_replicas = 0

Does that seem right?