Comments about concurrent queries in Druid

Hi,

I am trying to design a system that has the following requirements

  1. 100 Concurrent queries (mostly Historic)

  2. Ingest and Store ~1 million events per day

  3. Query responses must be under a second and queries are mostly aggregate / filter queries

Questions:

  1. What’s the recommended deep storage system for this system? (I’d prefer S3 to avoid management overheads)

  2. How many Druid cluster nodes should I have in the cluster?

Thanks,

Ranjit

Hi Ranjit, see inline.

Hi,

I am trying to design a system that has the following requirements

  1. 100 Concurrent queries (mostly Historic)
  1. Ingest and Store ~1 million events per day
  1. Query responses must be under a second and queries are mostly aggregate / filter queries

If it helps your decision making, we are doing about 100 concurrent per second in production right now and ingesting about 1 million events (usually <100 dimension, < 50 metrics) every 2 seconds, so Druid should be able to scale to your needs

Questions:

  1. What’s the recommended deep storage system for this system? (I’d prefer S3 to avoid management overheads)

Deep storage is just a permanent backup of data and is not involved in queries. If you have HDFS available, that is a popular option for deep storage.

  1. How many Druid cluster nodes should I have in the cluster?

It depends on the type of hardware that you have. Given your relatively low volume of data, you’ll likely generate 1 segment per day. You can probably get away with a very minimal setup and combining services on the same node. What type of hardware do you have access to?

S3 and HDFS are the most battle-tested deep storages, and I think you’d be happy with either one. The size of your Druid cluster is going to depend mostly on how much historical data you want to have available.