Operations tips and advice

As you may have seen on the druid dev group, some of us at Optimizely are experimenting with Druid and may soon have a service that relies on it.

I’m interested in the devops/ops side of druid, and have a few questions:

  • What do you guys use to spin up new druid instances? E.g.: chef, ansible, salt. So far we’ve forked off an outdated community chef repo and updated it to support druid v0.7, but I’d be curious to know if there are any other open source cookbooks/playbooks/automated config scripts out there.
  • Are there any tools out there that will benchmark/stress test a druid cluster?
  • Are there any tools out there that will perform an integration test on a druid cluster?
  • What are the most common types of problems you encounter on your production cluster? Are these caused by hardware failure, druid updates, insufficient capacity, malformed queries, or something else?
  • Druid can emit metrics (push them out), but is there any way to poll druid for those metrics?
    We’re just down the road from metamarkets, so if someone familiar with ops wants to grab a coffee sometime, I’d be happy to talk learn.

Conrad

Hi Conrad, please see inline.

As you may have seen on the druid dev group, some of us at Optimizely are experimenting with Druid and may soon have a service that relies on it.

I’m interested in the devops/ops side of druid, and have a few questions:

  • What do you guys use to spin up new druid instances? E.g.: chef, ansible, salt. So far we’ve forked off an outdated community chef repo and updated it to support druid v0.7, but I’d be curious to know if there are any other open source cookbooks/playbooks/automated config scripts out there.

We run in AWS and run jenkins jobs to spin up new nodes. We combine these jobs with galaxy (https://github.com/ning/galaxy) for management and deployment.

  • Are there any tools out there that will benchmark/stress test a druid cluster?

We’ve open sourced our query benchmarks on an older version of Druid here: http://druid.io/blog/2014/03/17/benchmarking-druid.html

We’d love contributions for more benchmarks

  • Are there any tools out there that will perform an integration test on a druid cluster?

We’ve open sourced our integration tests here: https://github.com/druid-io/druid/tree/master/integration-tests

It uses Docker to spin up a local cluster. This should be extendable to multiple nodes.

  • What are the most common types of problems you encounter on your production cluster? Are these caused by hardware failure, druid updates, insufficient capacity, malformed queries, or something else?

Multi-tenancy is always an interesting problem. Druid’s parallelization model uses 1 core to scan 1 segment. We can see periods of slowness where all the cores of our cluster are exhausted scanning segments, which causes queries to back up and slow things down. This is happen during peak hours when all users are all using the same application at the same time.

We’ve experienced problems where we didn’t properly monitor our historical nodes and we ran out of capacity on them. This causes handoff to stall, and eventually, the realtime ingestion will begin to self throttle in an attempt to avoid memory overflow problems.

  • Druid can emit metrics (push them out), but is there any way to poll druid for those metrics?

Currently no, we emit all of our metrics to Kafka so one thing we’ve done is write wrapper that push metrics to Kafka for systems that have a polling model.

We’re just down the road from metamarkets, so if someone familiar with ops wants to grab a coffee sometime, I’d be happy to talk learn.

Happy to chat sometime :slight_smile: