Druid Cluster deployment does not work at all

Hi guys, I need your help deploying Apache Druid Cluster.

I have 4 instances of AWS ec2:
m5.2xlarge for query server
i3.4xlarge for data server
m5.2xlarge for master server
and t2.medium for zookeeper (single node)
In addition, I have S3 as deep storage and PostgreSQL as the metadata storage on the master server.

The problem is that nothing wants to work, but I did everything that is described here
Druid Version: 0.16.0
All common runtime properties and runtime properties of services are attached to this message.
jvm and main configs are installed by default
Logs for all services in applications

After deployment, I try to go to the coordinator console on the master server (port 8081) and see the following image (coordinator-console.png in the attachments), and the router console on the query server (port 8888) looks like this (router-console.png) with a lot of mistakes.
He is trying to get a coordinator on the local host, but this service is on the master server, how can I configure this? It seems that they all do not see each other, what could be the problem?

S3, Postgresql and Zookeper are fine. S3 has logs after a failed data load, Postgres has 10 new tables, and Zookeeper receives packets from servers and has a druid node.

I will be glad of any help!
Have a nice day!

broker.log (95.9 KB)

broker runtime.properties (1.3 KB)

coordinator-overlord.log (122 KB)

coordinator-overlord runtime.properties (1.15 KB)

data common.runtime.properties (4.38 KB)

historical.log (91.7 KB)

historical runtime.properties (1.35 KB)

master common.runtime.properties (4.38 KB)

middleManager.log (84 KB)

middleManager runtime.properties (1.53 KB)

data common.runtime.properties (4.38 KB)

router.log (145 KB)

router runtime.properties (1.22 KB)

Hi, I can not (at first glance) see what is wrong but I am watching this closely as this is exactly the kind of think I would want to address as part of the Druid Doctor work (https://github.com/apache/incubator-druid/pull/8672)

I hope this gets resolved soon, meanwhile don’t hesitate to reach out for some “real-time” assistance on the #druid channel in the ASF Slack ( https://s.apache.org/slack-invite )

Hi Alexander,

Looks like druid nodes cannot reach each other. Could you try setting this property druid.host: localhost to correct hostname or IP address instead of localhost.

Thanks,

Surekha

Hi guys, thanks for your quick answers!
I really appreciate it!

The problem is solved! Indeed, I had to write their addresses in the common server settings in the druid.host setting. Thank you very much, Sureha!
But it’s a pity that there is no such information in the deployment guide, it was not quite obvious to me (

Hi,

We have been playing with single node cluster so far and now want to move on to setup multi-node cluster. As Alexander mentioned, there is not enough documentation on cluster deployment. We have few questions and would highly appreciate replies.

  1. If Master node goes down, which node assumes role of master to avoid downtime? Or is the cluster down until Master node comes back up?

  2. How does Master node see data nodes? Does it use zookeeper for discovery or list of ip addresses need to be hardcoded like mentioned below?

  3. If the host on which Master process (or any other Druid process) is running fails and has to be brought up on different host, do we have to change druid.host property and restart entire cluster? Sounds pretty manual and complex.

  4. Is it possible today or part of roadmap to use zookeeper for service discovery and make cluster resilient to failures and keep service available?

Please note that we are deploying cluster on-prem.

Thanks.

Hi Santosh,

Please take a look at the below document for clustered deployments

https://druid.apache.org/docs/latest/tutorials/cluster.html

Druid is engineered to be highly scalable and fault tolerant. And it uses Zookeeper for service discovery. The above document will help you get going. To answer your specific questions

  1. You can run multiple master nodes. We recommend 3 master nodes for fault tolerant design

  2. Zookeeper is used for service discovery within Druid

  3. The comments to your 1st question above answers this.

  4. Yes zookeeper is already part of the design and is leveraged for service discovery.

Regards,

Muthu Lalapet.

Great, thanks Muthu. We did setup a multi node cluster and it looks good!

I have one more question related to data in deep storage. Is it possible to retrieve data from deep storage on demand (during query time)? We want to store minimum data (few days worth) on local disks and rest of the data in cold deep storage such as S3 (for cost optimization). If any query requires data beyond what is on local disk then we want to fetch data from deep storage. Is it possible to do this?

Please note that this would be on-prem. So adding more hosts with slower disks on demand is not easy or scalable. But we could possibly leverage network attached object storage such as S3 as deep storage. Appreciate your thoughts!

Hi Santosh,

"Is it possible to retrieve data from deep storage on demand (during query time)? " - No this is not currently possible. However you can use data tiering to have a cold tier historical nodes with slower disks to host data that are older. However dynamically retrieving data from deep storage as S3 for a query is not possible.

Regards,

Muthu Lalapet.