MiddleManager abandons segment after failed hand off

Hey Druids

As you may well know, there was an AWS S3 outage of somewhat unprecedented scale today.

We use S3 for deep storage, and our MMs attempted to hand off during the outage. The logs show a bunch of retries in the S3 client, finally followed by a failure. At that point, it looks like the RealtimePlumber calls “abandonSegment”.

I was somewhat surprised that as all those tasks died, Druid went ahead and “helpfully” rm’d all of the segments, leaving no recovery path (manual or otherwise). This seems a little aggressive. Is there a configuration that we may have missed to enable any of the following scenarios?

  • keep retrying in the s3 client for a much longer window

  • keep the realtime task running until it succeeds

  • leave the segment on disk for manual hand off, fixup, etc.

Though this scenario is pretty rare, we’d like to handle it more gracefully. Thanks for any direction.

Max

Hey Max,

There’s no config for any of those, although those all sound like things that would make good patches if you wanted to take them up. The first one could potentially be an S3 deep storage config, the second could potentially be something in the tuning config of a realtime task (like, “don’t abandon segment”), and the last one could potentially be something in the middleManager config (like, “don’t delete task files when a task fails”).

Fwiw, the thinking behind the current behavior is that classic realtime tasks are meant to make sense as part of a lambda architecture, and so you don’t want them trying “too hard”. They should generally give up promptly so they can relinquish their lock and let a batch job take over. If you don’t have a batch job to take over, this makes less sense.

Also consider looking at the Kafka indexing service, which unlike classic realtime tasks, does try hard to get all your data in. It tries hard enough that it is able to guarantee that all of your Kafka data is loaded, exactly once, as long as you get it in to Druid before Kafka deletes it due to your topic’s retention policy. It’s designed that way so you can run it by itself without any lambda-style batch jobs to pick up things it drops on the floor.

Gian

Thanks for the feedback. The path to implementing all of those potential tweaks makes sense.

My thinking is that if you’re looking at overall time to recovery from a failure like this, it’s always going to be faster to keep trying to ship the segment you’ve already built than to have to replay hours of data from kafka, or run a large batch job.

This could be a pathological case, but if your batch layer is also S3/EMR then I question whether that’s actually a reasonable backup strategy in the case of an S3 outage. Amiright?

I can see a case being made both ways, I suppose. You make a good argument for realtime tasks trying harder to finish than they currently do. But also, if they try too hard then you run into situations where poorly configured or broken tasks can hold up processing indefinitely through retries that are never going to accomplish anything (maybe the hardware is bad, disk full, S3 access not configured right, etc). At any rate, the defaults for Tranquility-style setups tilt towards giving up after “reasonable” efforts and assume that there is some backup batch processing layer to get the data loaded eventually. More options to let users tilt the other direction would be great.

I think relying on batch backup still makes sense even in the face of S3 outages, since data would still be available in Kafka for replay up to your retention period. All you have to do to avoid data loss is make sure Kafka stays up and keeps accepting data, and make sure that you get your consumers recovered before your retention period elapses. In AWS you can do a resilient Kafka deployment by using multiple AZs or even regions (depending on your personal fault tolerance vs. complexity tradeoff).

And if you don’t have batch backup, or don’t want to rely on it, then the Kafka indexing service (http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html) is designed to be reliable on its own, without batch backup.