Some confusions about Druid

Hello, everyone. I’m a newbee to druid. After I read the Druid thesis(sigmod’14), I have some confusions.
The first one is that, why not use external dependency system as small as possible, for example, druid already has zookeeper as a store for cluster info, why not put the data in mysql to zookeeper? When we use mysql in production, we need guarantee its reliability, so does for zookeeper. I think current setting has too much difficulties to overcome.

The second one is about the Historical node. When using share nothing architecture for historical node, we need to keep more replicas for segment, right? We first download it from deep storage and then serve query from it. I think there exists too much redundancies of data, maybe we can reduce it.

Hi Arvin,

I think we could remove the dependency on mysql and zookeeper by storing metadata and coordination info in Druid itself (with something like Raft: https://raft.github.io/).

They’re currently separate external dependencies for largely historical reasons. First we used only zookeeper, then we split out the metadata store since zookeeper wasn’t scaling well enough to store all the metadata we wanted to store. But we didn’t completely replace zookeeper with a metadata store, since zookeeper is still useful for leader election and service discovery.

The deep storage design has its pros and cons. The biggest pro is that it makes scaling super easy: you can add or remove historical nodes anytime you want, and the system will automatically transfer data over from deep storage to rebalance. The con of course is that you need a deep storage system storing an additional copy of your data.

Fwiw, Druid was originally designed to run in a cloud environment, where metadata store and deep storage are easy to deploy (you can use AWS RDS and S3).

Thank you very much~

在 2018年7月3日星期二 UTC+8下午3:03:09,Gian Merlino写道:

For your second question, I had same confusion at first.
But I think the focus of Druid is processing and analyzing. it doesn’t care about how to store data very well. So just ask someone else to do this thing.