Druid vs. InfluxDB

So I have been doing a bit of research into time-series databases, and Druid and InfluxDB have caught my attention.
I wanted to post some of my thoughts and was wondering if others think this is as an accurate analysis of the technologies.

For my use cases, I am currently leaning towards Druid which seems to emphasize speed even more than ease of usability.

What do you all think?

Druid vs.
InfluxDB

·
Why
Druid is better

o
Major
Emphasis – Speed and Scalability

o
InfluxDB
does not scale well at this point – the maximum and required cluster size is 3
nodes, whereas Druid is horizontally scalable without bound.

o
InfluxDB’s
performance degrades significantly when grouping by tags (dimensions in Druid)
with cardinality > 100,000

o
InfluxDB
uses BoltDB as its internal storage engine and therefore does not provide the
flexibility that Druid does in selecting a backend (S3, HDFS, or local storage).

o
Ability
to write custom, Javascript aggregation functions.

o
Offers
real time, reliable ingestion through public APIs and Kafka. InfluxDB does not
just hook-up to Kafka (there is a project on Github to do this, but it has only
2 contributors).

·
Why
InfluxDB is better

o
Major
Emphasis – Usability and Simplicity

o
Offer
SQL-like query language that is very intuitive - much easier than having to frame queries in JSON

o
Several
user interfaces already exist to allow for data exploration and visualization
(Grafana).

o
Comes
with more built-in aggregation functions – PERCENTILE, STDDEV, etc.

o
Is
completely schema-less in that you can add columns on the fly – must keep
column data types consistent in order to get expected query results.

o
Has
no external dependencies (Druid relies on MySQL and Apache Zookeeper for
exploration).

o
Much
simpler to expire data – adding a retention policy (1 day, 1 month, etc.) can
be done in one line – whereas in Druid you must write a rule to the Coordinator
config which is more difficult.

o
Allows
for joins across series (SQL table equivalent)

Hey Austin, thanks for the writeup! You’re right that Druid focuses on speed and scalability. I’d just add a couple of things:

  • Druid does support adding columns on the fly. You can change the ingestion spec (dimensions & metrics) any time you want, and the new spec will take effect for newly ingested data. Queries will work against both older schemas and newer schemas simultaneously, as long as the column types don’t conflict. You can also run Druid ingestion in a “schemaless dimensions” mode, where any field that you don’t explicitly list as a metric automatically becomes a dimension.

  • There are a couple of community-contributed GUIs: a Grafana plugin (https://github.com/Quantiply/grafana-plugins) and a GUI built from the ground up for Druid (https://github.com/mistercrunch/panoramix)

  • There are also community-contributed SQL-like query languages for Druid, one for Java and one for JavaScript (the latter can also be used from the command line): http://druid.io/docs/latest/development/libraries.html

  • Druid actually can compute percentiles, using its approximate histogram & quantile aggregator. It can require a bit of tuning to get the right accuracy vs storage tradeoff for you, which is why it’s labeled experimental.

  • The coordinator config rules can be added programmatically in a single call (there’s an HTTP API). I’m not sure if that makes life easier or harder for you, though…

I would also note that it does not require MySQL as an actual DB, it works with postgresql as well. But to say it requires an external metadata store is true.