Small scale deployment?

The use cases for Druid I’ve seen so far have all been big data scenarios with petabytes of data etc.

However I’m wondering if Druid can also be a good choice and cost effective for small scale deployments. The benefit would be to be able to create a solid architecture once and then casually scale it up as time goes on instead of starting with a RDBMS or similar and then having to rearchitecture the whole thing when that becomes difficult to scale.

As a starting point we can imagine a web startup needing to track only about 5000 events per day. They’re currently pushing events to Kafka so we can assume that will be the integration point for Druid ingestion.

What would be some minimal EC2 setup needed to support such a scenario?

Given that the druid quickstart tutorial recommends a minimum of 2vCPUs and 8GB of ram could one get away with running the whole Druid ecosystem on a single m4.large node for example? Or better to cluster together some even smaller instances?

A related question when aiming to scale as small as possible, how is Druids durability without redundancy? Will Kafka + single Druid node + S3 deep storage be durable and catch up on data not yet persisted to S3 in case of a crash (sorry if I’m missing something obvious here, I have yet to did deep into the Druid documentaiton)?

Thanks,

David

Hey David,

There are a number of folks using Druid at smaller scale (1–5 nodes). It can be a bit unwieldy, since many of its APIs are designed for the “at scale” use case and can be awkward at small scale, but it definitely works. An m4.large would be cozy but should work.

Kafka + single Druid node + RDS metadata store + S3 deep store would be durable. In the event of a crash, anything not persisted to S3 would be re-read from Kafka. The RDS metadata store is important, though, since you want to preserve the metadata records associated with the S3 deep store data. If you’re really looking to cut costs, you could get away with non-RDS metadata store, but you’d want to at least put the metadata DB on some persistent volume and back it up periodically.