Multi data center setup of Druid

Hello,
We are planning to use Druid along with spark streaming for our realtime processing requirements.

We have a multi data center setup where data is produced in all the data center and our requirement is to have a aggregated view of data in druid for analytics.

Which among the below mentioned approaches is recommended:

  1. Druid cluster is setup in only 1 data center and we perform remote writes(data ingestion in remote data center) from all other data centers.

  2. Druid cluster is spread across data center boundaries such that there is 1 druid cluster consisting of nodes from all data centers and overlords are running in each data center,this will ensure data ingestion happens in local data center.
    This can suffer from lot of rebalancing if nodes loose connection with the zookeeper.

  3. We move all the data in one data center using data bus like Kafka etc. This would be expensive in terms of data movement hence a less preferred option.

Other options are also welcome.

Thanks

Rohit

Hi Rohit, please see inline.

Hello,
We are planning to use Druid along with spark streaming for our realtime processing requirements.

We have a multi data center setup where data is produced in all the data center and our requirement is to have a aggregated view of data in druid for analytics.

Which among the below mentioned approaches is recommended:

  1. Druid cluster is setup in only 1 data center and we perform remote writes(data ingestion in remote data center) from all other data centers.

I would say this is the best option and is similar to what I know some other organizations do. I am interested in that when you say remote writes, can you elaborate on what that means?

  1. Druid cluster is spread across data center boundaries such that there is 1 druid cluster consisting of nodes from all data centers and overlords are running in each data center,this will ensure data ingestion happens in local data center.
    This can suffer from lot of rebalancing if nodes loose connection with the zookeeper.

Agree. I think timeouts with Zookeeper will present interesting challenges in this case.

@Fangjin Yang
In option 1) ,we would deploy a storm cluster in each data center which would push data to an overlord deployed in some other data center,this will imply that all write calls would happen over the internet rather than LAN.Hence i am skeptical about its performance,throughput etc.

Can this setup also lead to loss of events in some cases?

Thanks

Rohit

I don’t have much experience with multi data center setups that include Storm and Druid, but in general real-time processing with open source technologies does not guarantee exactly correct results, which is why many folks run a lambda architecture in house.