High Availability Druid cluster requires a minimum of 12 machines


We are evaluating Druid as an analytics store but the devops story seems quite complicated. Am I reading this right that a high availability cluster requires at least 12 machines?

Overlord x2

MiddleManager x2

Coordinator x2

Historical x2

Broker x2

Real-Time x2


Are most people deploying this combining nodes onto single machines or deploying all 12 and scaling/tuning from there?




– Himanshu


Overlord/coordinator x2

Middlemanager/Historical x2

Broker x 2

You might also want to read about HA Druid here: http://imply.io/docs/latest/cluster.html

Hello but i was wondering if you could go into detail in order to achieve a HA druid cluster. Currently I have coordinator,broker,historical,realtime and overlord in my set up. SO far I only have one of each. So in order to achieve an HA environment i would just add a copy of each node and let them communicate with zookeeper and that is it? With this in mind is there like a test tool created so that while we are testing our druid cluster’s performance we will also be able to test out its HA.

We have some new clustering docs available here that should be useful: http://druid.io/docs/0.9.0/tutorials/cluster.html

To get HA you need 2x Coordinator, 2x Overlord, 2x+ Broker, 2x+ Historical, and 2x+ MiddleManager. You don’t get much benefit from adding more Coordinators and Overlords (they are failover-based HA) but you do get scaling benefit from adding more Brokers, Historicals, and MiddleManagers.

This doesn’t mean you need 10 machines. Especially for smaller clusters it is very common to colocate Druid services on the same physical machines. You could in theory get by with 2 physical machines although most people do 4–6 for a basic cluster (separating data-heavy services from coordination services).

Also you generally don’t need both Overlord and Realtime nodes. Generally you pick one or the other (we recommend Overlords these days)

Hi Gian,

Thank you very much for replying to my comment so quickly.But what about the configuration to attain HA? Do we just have them connect to zookeeper? Or is there any configurations that we would need to do? and is there a script that we can use to test the transition of failover in the cluster?



Just FYI, our method of testing failover on our cluster is by doing normal operations like rolling restarts for upgrades or configuration changes. And we exercise it regularly.

Hey Shim,

The Druid nodes are all automatically HA assuming you have configured them to use an external metadata store and zookeeper cluster (rather than the default embedded metadata option). All you have to do is start up multiple of them.

It is up to you to make your metadata store and zookeeper cluster HA. The usual way of doing that is using MySQL/PostgreSQL with replication and failover, and setting up a 3 or 5 node zookeeper cluster.

Hi everyone,

I am sorry for the late reply. Hmm well i have set everything up but i was wondering if there is a test script to test the fail over besides upgrading each node one by one? I was hoping something that I would leave running form the broker node which will test it’s communication to the coordinator and I would shut down one of the coordinator to test the failover. I’m sorry if I have a lot of questions



In addition to my earlier post, I currently have 5 node types deployed in my cluster namely overlord,broker,historical,realtime and coordinator. Currently we only have one of each and are planning to set up an HA environment. I am aware that I could have coordinator and overload in the same nodes which reduces my HA cluster to 8. Are there any other nodes that I can merge together to minimize the amount of servers deployed.

You should take a look at how Imply choose to package Druid if you want to use less hardware: