Who wins the repl race?

When you run with real-time replication (e.g. 3), how does Druid determine which one wins the race to deep storage?

I ask because we had a task stuck in pending for a few minutes. It eventually got a slot, but it was missing those first few minutes of data. When the broker sent queries to that task, Pivot showed a noticeable gap. We now kill any duplicate tasks that get stuck in pending, but I’d like to understand better what happens during segment persist on real-time tasks.

(pointers to code work =)



  • 1

The first one submitted for handoff should win the race to deep storage.

The case you hit is an interesting one as Druid treats all replicas as the same even though they may have slightly different data.

Why was the task stuck in pending? Was it a lack of slots available?

We had slots available, but we had spun up a bunch of new MiddleManagers to increase capacity. From what we’ve seen, if you adjust MiddleManagers (add nodes, kill nodes, etc.), sometimes the overlord and coordinator don’t pickup on the changes w/o a restart.
(which can leave tasks in pending) We double checked availability groups, etc.

Future feature request:

Introduce strategies for deep storage persistence across replicated tasks. =)

i think most people would want the most complete repl…

we run repls in separate availability zones, so network partitioning might cause one repl to have less data.


Hey Brian,

The overlord really should be picking things up without being restarted. There’s a few large AWS based Druid clusters that regularly launch and terminate middleManagers, without too much fuss, so I’m curious if there is something going on in your deployment that makes problems more likely to happen.

Do you have a lot of network partitions? ZK outages? Anything like that?

If you manage to catch this happening again, it’d be great if you could file an issue here with whatever details you have available (description of the outage, screenshots of the overlord, logs): https://github.com/druid-io/druid/issues

ZK seems to be fairly stable.

I’ll absolutely try to catch this next time it happens.

(We typically dump the state from ZK and have a look.)

Curious – are those larger deployments using the auto-scaling?


Hi Brian, yes, we do use EC2 auto-scaling here at Metamarkets.

Cool. We are still scaling manually, but plan to move to auto.

I wonder if that could be part of the issue.

We’ll keep an eye out and let you know what we uncover.


Depending on what version you are running, you could be hitting

That was causing pending tasks on our end at one point.