Overlord: "The leader threw an exception" related to Apache Curator

Hello all.
My team has a druid cluster in production, and we’ve had several Overlord outages in the past 2 months. The Overlord remains in memory, but is non-responsive. We have to kill -9 it.

Looking back at our logs, I see a pattern for the outages. All of them include these two lines:

2016-09-09T22:39:48,546 ERROR [Curator-LeaderSelector-0] org.apache.curator.framework.recipes.leader.LeaderSelector - The leader threw an exception

java.lang.IllegalMonitorStateException: You do not own the lock: /druid/indexer/leaderLatchPath

Digging a bit deeper on the Curator’s LeaderSelector class, I found a defect on Apache’s issue site that has been resolved recently for the Curator.

https://issues.apache.org/jira/browse/CURATOR-337

The fix was appled on July 28, 2016, and is included in Curator versions 3.2.1 and 2.11.1.

I checked our Druid (0.9.1.1) and it uses Curator version 2.10.0

Has anyone else experienced this error?

Any advice on upgrading Curator?

Thanks–
Chris Freyer

Hi Chris, when you say outage, can you describe more what happens? Do you have logs of the overlord during these outages? I wonder if the problem is something else and the overlord error messages are a symptom rather than the cause.