High Availability Questions

Hi All,

We are setting up Druid for High Availability and have some questions regarding High Availability and Failover.

Currently our Druid cluster comprises of the following:

  • 1 Master Node running Zookeeper/Overlord/Coordinator

  • 2 Data Nodes running Middle Manager and Tranquility

  • 1 Historical Node running the Historical process

  • 1 Query Node running the Broker process

Till now, we have done/will do the following:

  • Tranquility: We have set "tasks.replicants : “2” in the Tranquility configuration. This ensures that even if one task or machine goes down, another task in a separate worker machine will be alive and continue to capture events.

  • Placed Tranquility behind a Load Balancer to achieve increased throughput and failover if one machine goes down.

  • We will add another query node and put them behind a load balancer. This should divide the load between the 2 machines and will also ensure that if one goes down, the other can continue serving query requests.

  • We will add another Master Node running Zookeeper/Overlord/Coordinator and set “druid.zk.service.host” property to the 2 machines like this: zk-host1:port,zk-host2:port

Should we put the same string to the “zookeeper.connect” property on Tranquility, or do we need to put them behind a load balancer to have a common address?

Based on these, we have a few questions we would like your expertise on:

  1. Are there any other properties that we need to configure on the overlord/coordinator to achieve High Availability?

  2. Is it enough to have 2 zookeeper instances or we must definitely have 3? Do we need any other configuration?

  3. How can we switch replication on a historical node? I read in the forums that replication is 2 by default. If this is true, do we have to make any changes to the rules through my coordinator UI? Also, how can we check that replication works?

  4. Since we already have 2 Data Nodes (MM & Tranquility) do I need to add another 2 in order to ensure High Availability or is it enough that we already have 2 machines? I guess if one goes down, the Tranquility “replicant” tasks will not be both assigned to the same machine. Following that logic, is it then enough to add just 1 machine for HA so that even if 1 machine out of the 3 goes down, then tranquiliy can still assign the tasks to the remaining 2 machines?

Thank you in advance for your help.

Kind regards,
Petros

Having 2 ZK node is as good as having 1 because even if one ZK goes down you will have outage.In case you have 3 ZK you can still continue to function if 1 ZK goes down

Hey Petros,

Like Pushkar mentioned, you will need 3 ZK nodes to get redundancy. You should have 3 data nodes too, since if you only have two, then when one is down new task sets won’t be able to start up (both replicants need to be started). Other than that, you will also want to take a look at your metadata store. If this is MySQL or PostgreSQL then you can set up replication and failover.

You can double check that Druid’s replication is working by using the coordinator/overlord web consoles and APIs. The coordinator will tell you which segments are loaded on which historical nodes, and you can confirm that each segment is loaded twice on two different nodes. The overlord will tell you which workers are running which tasks, and you can confirm that replica tasks exist and are placed on different workers (the partition and replica number is the last _X_Y in the task id).

Hello All,

Anybody tried High Availability ? If yes, can you guys please explain me the architecture and config settings used to achieve this.
.