Unavailable Segments after PostgreSQL and Zookeeper restarts

Relates to Apache Druid 0.21.1

Last week, we updated Superset running in the same Docker Stack as Druid 0.21.1. Updating superset triggers restarts of the (1 replica) Zookeeper and PostgreSQL services.
The next day, the customer noticed that new data wasn’t available, and we found that 116 segmens (1 per day) were not available.
In the Coordinator logs, we found the following logs:

2021-09-10T10:09:08,121 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.duty.LogUsedSegments - Found [791] used segments.
2021-09-10T10:09:08,121 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue is empty.
2021-09-10T10:09:08,122 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.duty.RunRules - Unable to find matching rules!: {​​​​​class=org.apache.druid.server.coordinator.duty.RunRules, segmentsWithMissingRulesCount=675, segmentsWithMissingRules=[<10 segments>]}​​​​​

and

2021-09-10T10:11:58,129 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.duty.BalanceSegments - Found 1 active servers, 0 decommissioning servers
2021-09-10T10:11:58,129 WARN [Coordinator-Exec--0] org.apache.druid.server.coordinator.duty.BalanceSegments - [_default_tier]: insufficient active servers. Cannot balance.

I increased historical replication to 2 nodes, but it did not solve the issue.
In the end, I restarted the whole stack, following Druid segments become unavailable after data ingestion - #3 by Shashank_NS. I also started the ingestion manually (normally run every 10 minutes automatically). After a few minutes, the problem was resolved and no more segments were unavailable.

Any idea what caused the segment unavailability? I fail to see how the Superset update (1.2.0->1.3.0 and usage of server-side pagination, e.g. data is now queried from Druid in a leaner way) could have caused unavailable segments. Therefore my working hypthesis is that it has something to do with the PostgreSQL and Zookeeper restarts. Might they have caused some sort of hiccup?

We have now 675 segements available, which seems to be the correct number. All data seems to be available. I don’t know where 791 (see logs) came from.

How many coordinators do you have, may I ask? Did you see any errors in coordinator logs? We saw a recent similar (I think) issue, that involved coordinator errors as well, and are still looking into it, and how to fix it. In the end, it had to do with a coordinator failing to come up as master, then being reassigned as master, and code around that.

Hello. I’m late to the party. Here is cake.

AFAIK the number of used segments reported there is a simple SQL query on the metadata DB where is_used is true. That would seem to indicate that, at that moment, the coordinator did manage to query the metadata DB and got that result back.

The missing rules error would indicate that, at that time, the coordinator thought that the time period those segments in that table datasource were not covered by a load rule. The rules are in the metadata database, but I’m not sure when they’re refreshed: maybe the coordinator couldn’t get the rules, @Ben_Krug ??

Re: the cannot balance error, it looks like you had just one historical server. I would have expected increasing the replication to 2 to make Druid more confused because you would need 2 servers for replication to be applied successfully.

Just following on Ben as well, in production don’t forget you ought to have 3+ master server nodes to allow for rolling upgrades (for example) and your metadata database service should similarly be resilient to allow for upgrades. I mean, we’re not talking i3.16xlarge machines or anything… it’s more about having a good number of processes.

And now, I shall go back to my cake.

1 Like