First of all, i wanted to say that my colleges and I are pretty interested in Druid. All the features, mainly the aggregations at ingestion time, are way too cool.
So we have some datatypes that fit perfectly with the timeseries definition plus the tasks that can be run at ingestion time are convenient for us. We want to it have it running forever (like a normal DB) so it can ingest streaming data through Tranquility Server (we already achieved this with a simple VM instance and a GCP Function).
We were wondering which options we have to deploy a clustered Druid in GCP but having it auto-managed to handle demand - so no normal VM solution. We found that we can use:
a) a docker image per server (master, data, query) on GKE so Google can autoscale it and so. But Druid still didn’t release their official image (they say it will come up with the release of 0.16);
b) a GCP Dataproc cluster with Druid and Zookeeper components installed.
So we choose the option b, but now i have a question and a problem. First, a Dataproc cluster can be left operating forever? since we want it to be like a normal DB for some datatypes; Second, if so how can we instantiate Druid in Dataproc 'cause we took a look to the Druid’s directory files and they are all already configured but we can’t start the supervise script.
So, anyone has already being using Dataproc for Druid? and if so, you only use it for batch jobs or a continuos one?
Thanks in advanced!
Best Regards, Julian