How to combine GroupBy And TopN?

When we do time series query, it is normal that it will return lots of time series data after groupBy(i.e. cpu usage group by host). And often we want to select the TopN from these series, and the sort rule is not depend a single dimension, it depends the whole value of these series(i.e. the highest 5 average of cpu usage). If we can do this, we can answer a lot valuable question, like, what’s the highest 5 CPU average usage during the past 4 hour. How many host’s incoming bandwidth run upper than 120MB/s.
Of course these can be done in front end of the engine, but we think it’s much more efficient if the query engine can support these queries.


Hey Kurt,

For some queries like this, you can use nested groupBys, which work by distributing the inner query, collecting the results on the broker, and then doing the outer query on the broker. But I think that nested groupBys wouldn’t be able to do the specific query you’re asking for. I think the best way to do that is to do a time-granular groupBy to get the various timeseries, and then do the final sort step on the client side.

Hi Gian,
We have done the thing you suggested, and our client side is a web page. Normally this works fine, but if the group number is huge, like one million, the large memory need to store all this data would kill the performance. This involves server side to do some “partition” and “top” calculation. I think this is a very interesting and common problem for time series query, an advanced OLAP query, it’s very nice if druid could support this.

I do agree that it’d be nice for Druid to be able to do things like this in the future.