Some beginner's questions on using Druid to store and query lots of timeseries data

Hello all,

I’d like to use Druid as a time series DB and have some beginner’s questions, sorry if the questions have been asked before.

use-case:

I have several year’s worth of sensor data (sampled at 1 Hertz, a couple of dozen sensors for now, but may scale up to couple of hundreds later) that I’d like to store and make available for interactive analysis. I.e. a dashboard that allows to query a couple of days worth of data from random timefrimes out of the several year with second-granularity.

Here are the questions:

-would you recommend Druid for this usecase? If not, what would you recommend?

-should I store sensor values (mostly floats) as metric or dimension? I am mostly interested in displaying the raw value.

-on visualization: I have tried Superset but ran into issues with gaps in sensor data which seem to prevent aggregations (e.g. max value in 10s buckets) from working; I read that resampling can fix this, but isn’t this resource intensive? Is there a way to deal with this at ingestion (e.g. treat nulls as 0 for aggregation)?

-I have previously used Graphite/Grafana, which seems to adapt aggregations based on the query (i.e. zooming out of a graph requests aggregates over larger time frames than zooming in); does Druid/Superset have a similar support or do you have

recommendation on how to set up this kind of functionality?

-should I store derived metrics such as aggregations with different granularities at ingestions for this purpose (i.e. zooming in and out without overloading the front-end)? Do you have an example?

Thanks for reading this far and sorry if the questions are too basic. I’d also appreciate links to helpful resources.

Best regards,

Jan

Hi Jan,

My experience is that Druid is especially competitive for timeseries data when you need to do slice/dicing across timeseries, or when you have high cardinality dimensions (what some timeseries DBs call tags). Those kinds of operations play to Druid’s strengths.

I’d store the sensor values as dimensions, since that will let you retain the raw value. If you store them as metrics then they may be rolled up.

With regard to visualization, other than the apps you mentioned, ours at Imply (https://imply.io/product) is the one I’m most familiar with. It can treat nulls as zeroes and we are looking at adding more options (like interpolation / last value). But we haven’t added those yet.

With regard to derived metrics and aggregations: you should just let Druid handle those. It is what Druid is designed for! Check out the “timeseries” query which lets you change granularity and aggregations on the fly: http://druid.io/docs/latest/querying/timeseriesquery.html

Hi Gian,

thanks for your response! I have read that Druid is good for slicing/dicing; in my user-case, where I mostly want to visualize historical data, whenever users zoom out they would look at aggregates over bins so as to reduce the amount of data displayed. In that sense I guess Druid is a good choice. When the user zooms in to look at the most granular level the query would be a simple “select everything between two timestamps”.

I will take a look at Imply to see if it can do what I had in mind.

Best regards,

Jan