A bunch of questions!

Hi All,

Druid guys asked me to share my questions with the community, so here are my questions:

  1. Size of the biggest cluster currently in production? #servers, #events​, …
  2. Bulk insert API (eg: for initial feeding) ?
  3. Support of idempotent insert? I mean what happens if I insert several times the same data point (ie: same timestamp, same dimensions), only one entry or several entries in the DB?
  4. Support of upsert operation? I mean, is it possible to update an existing data point?
  5. Interpolation support? I mean, manage the aggregation of multiple timeseries not lined up.
  6. Availability of a AWS CFT or set of Docker images to deploy a test environment?
  7. Support of stddev, derivative/rate, difference aggregate functions, …?
  8. Support for custom aggregate functions? I saw the possibility to create a javascript function, but is it possible to extend Druid with a Java plugin system?
  9. Support for custom batch processing? I mean, execute a kind of map/reduce job (or similar) over a large subset of data managed by Druid independently of the deep storage used.
  10. Efficient Spark connector (if hdfs is used as deep storage) to compute complex analytics? I mean, use directly segments stored in hdfs to batch process these data by a solution like Spark, Impala…
  11. Full text search on text columns? If no, is there a plugin for Elastic Search?
  12. Support for metric discovery? I mean an access to a dictionary of metric/event types existing in the system (with regex support).
  13. Query UI?
  14. Admin/Monitoring UI?
  15. Access control, and granularity of this access control? I mean control the access to the database or to a subset of the database for a set of users. By granularity, I mean access control on the DB, on a specific time serie.
  16. Commercial support and pricing?
  17. of core developers?

Best

Laurent Querel

Hi, please see inline.

Hi All,

Druid guys asked me to share my questions with the community, so here are my questions:

  1. Size of the biggest cluster currently in production? #servers, #events​, …

50-100 PB of raw data. 20 trillion or so raw events. After roll-up and compression, it becomes 500TB of Druid segments across 400ish nodes.

  1. Bulk insert API (eg: for initial feeding) ?

Bulk inserts are done via generating segments using Hadoop and mapreduce. See: http://druid.io/docs/latest/Batch-ingestion.html

  1. Support of idempotent insert? I mean what happens if I insert several times the same data point (ie: same timestamp, same dimensions), only one entry or several entries in the DB?

Right now we require deduping to be done outside of Druid. However, if you batch insert (via Hadoop indexing) a static set of data and regenerate a set of segments, Druid will atomically replace the old set of segments with new set of segments.

  1. Support of upsert operation? I mean, is it possible to update an existing data point?

This is supported, but an expensive operation. Appending data is cheap (Druid is designed for append-heavy data), but updates of appended data requires reindexing segments.

  1. Interpolation support? I mean, manage the aggregation of multiple timeseries not lined up.

This currently needs to be done at the client level.

  1. Availability of a AWS CFT or set of Docker images to deploy a test environment?

We have a docker script to set up a cluster locally.

https://github.com/druid-io/docker-druid

Our integration tests also use docker to spin up a cluster.

  1. Support of stddev, derivative/rate, difference aggregate functions, …?

Not out of the box. There will be some coding you’ll have to do on your side to support some of these aggregations.

  1. Support for custom aggregate functions? I saw the possibility to create a javascript function, but is it possible to extend Druid with a Java plugin system?

Yes. You can write your own modules for custom aggregate functions. The javascript aggregation function can be used to prototype a new aggregator.

  1. Support for custom batch processing? I mean, execute a kind of map/reduce job (or similar) over a large subset of data managed by Druid independently of the deep storage used.

Druid bundles hadoop based batch indexing out of the box.

  1. Efficient Spark connector (if hdfs is used as deep storage) to compute complex analytics? I mean, use directly segments stored in hdfs to batch process these data by a solution like Spark, Impala…

Some folks in the community are working on this. FWIW, if you choose to ues Druid, you may not need to use Impala.

  1. Full text search on text columns? If no, is there a plugin for Elastic Search?

This is not currently supported but something we’ve been thinking about. Some folks out there run Druid and ES together.

  1. Support for metric discovery? I mean an access to a dictionary of metric/event types existing in the system (with regex support).

There are endpoints for schema discovery, but regex is not currently supported.

  1. Query UI?

None that are open source, although some will be open sourced soon.

  1. Admin/Monitoring UI?

There’s a basic admin UI bundled with the Druid coordinator.

  1. Access control, and granularity of this access control? I mean control the access to the database or to a subset of the database for a set of users. By granularity, I mean access control on the DB, on a specific time serie.

There is control on a per datasource level around data retention but no current support for different sets of users (usually we see this done on the client side).

  1. Commercial support and pricing?

Support is currently provided by the community.

  1. of core developers?

There are numerous engineers at Metamarkets, Yahoo, and other places that works on Druid full time. The developers are here:

Thanks a lot Fangjin for all these clear and precise answers.

By the way, if you have some pointers (people or repo) about the integration of Druid with Spark that could be very helpful because I anticipate to do the same thing.

Laurent

Hi Laurent,

Himanshu has done some work on loading already-indexed Druid data through a Hadoop InputFormat. You may find this thread interesting: https://groups.google.com/forum/#!searchin/druid-development/inputformat/druid-development/EXYAPcV6pk4/Nu1uRGKEctAJ