We are planning to use Druid along with spark streaming for our realtime processing requirements.
We have a multi data center setup where data is produced in all the data center and our requirement is to have a aggregated view of data in druid for analytics.
Which among the below mentioned approaches is recommended:
Druid cluster is setup in only 1 data center and we perform remote writes(data ingestion in remote data center) from all other data centers.
Druid cluster is spread across data center boundaries such that there is 1 druid cluster consisting of nodes from all data centers and overlords are running in each data center,this will ensure data ingestion happens in local data center.
This can suffer from lot of rebalancing if nodes loose connection with the zookeeper.
We move all the data in one data center using data bus like Kafka etc. This would be expensive in terms of data movement hence a less preferred option.
Other options are also welcome.