Failed to read beam data from cache?

We’ve seen a few issues w/ Zookeeper stability putting Druid into a bad state.

One of them we observe on the tranquility side, which stems from this line:

https://github.com/druid-io/tranquility/blob/master/core/src/main/scala/com/metamx/tranquility/beam/ClusteredBeam.scala#L153

What happens when curator can’t get data back from ZK?

It appears to result in a JsonMappingException that isn’t surfaced to the app layer, and from which tranquility can’t recover.

Any ideas how we could recover from that exception?

-brian

Hey Brian,

When exactly does this happen? Are your znodes getting removed or corrupted somehow?

If your znodes are disappearing or getting corrupted, you should be able to recover by stopping all your tranquillity senders, clearing out the metadata in ZK (by default in /tranquility) and then re-starting them. But this should really NOT be a routine operation that you do for routine connectivity problems. A properly configured ZK cluster should be immune to corruption even in the face of downtime and crashes. So if this is happening for routine issues I would double-check your ZK configuration to make sure the quorum is set up properly.

Thanks Gian.

Agreed. This should not be routine operation. =)

We saw some dependency funkiness around jackson, which we “fixed”, by aligning dependencies as much as possible.

I’m hoping that might be it.

-brian