So I have been doing a bit of research into time-series databases, and Druid and InfluxDB have caught my attention.
I wanted to post some of my thoughts and was wondering if others think this is as an accurate analysis of the technologies.
For my use cases, I am currently leaning towards Druid which seems to emphasize speed even more than ease of usability.
What do you all think?
Druid is better
Emphasis – Speed and Scalability
does not scale well at this point – the maximum and required cluster size is 3
nodes, whereas Druid is horizontally scalable without bound.
performance degrades significantly when grouping by tags (dimensions in Druid)
with cardinality > 100,000
uses BoltDB as its internal storage engine and therefore does not provide the
flexibility that Druid does in selecting a backend (S3, HDFS, or local storage).
real time, reliable ingestion through public APIs and Kafka. InfluxDB does not
just hook-up to Kafka (there is a project on Github to do this, but it has only
InfluxDB is better
Emphasis – Usability and Simplicity
SQL-like query language that is very intuitive - much easier than having to frame queries in JSON
user interfaces already exist to allow for data exploration and visualization
with more built-in aggregation functions – PERCENTILE, STDDEV, etc.
completely schema-less in that you can add columns on the fly – must keep
column data types consistent in order to get expected query results.
no external dependencies (Druid relies on MySQL and Apache Zookeeper for
simpler to expire data – adding a retention policy (1 day, 1 month, etc.) can
be done in one line – whereas in Druid you must write a rule to the Coordinator
config which is more difficult.
for joins across series (SQL table equivalent)