I have a idea that put batch data and streaming data into druid . The data is in GB scale. SO , can anyone tell me that if I use druid to query complicated sql statement such as group by , order by , how about the speed of query? Whether can I get the query result in a second or less?
You can use SQL on druid to run Group By and Order By queries. If your cluster is sized appropriately for the amount of data and number of concurrent users, sub-second queries are absolutely achievable in most cases.
So, What is the matching rule between the amount of data and the size of the cluster?Any officical suggestions or expericence? Thanks!
There are many variables that impact overall performance. I would say in general you want a cluster with at least enough historicals to fit your most frequently queried data entirely in memory. So, if you have 10 GB data per day, and a typical query is 30 days of data, I would recommend having at least 300GB of RAM between all of your historicals (5 nodes with 64Gb will perform slightly better than 10 nodes with 32 GB).
The good news is that Druid scales very easily, so if you start with 5 historicals and aren’t getting the performance you want, adding nodes is very simple and all rebalancing of data is done automatically.
Also keep in mind that data size in druid is often much smaller than the raw data size, so you will want to do a small scale test to see how much your data compresses, this is especially important if you intend to do roll-up.
Yes – it would work good. Give it a try!
Thanks & Rgds