Calculate total sum of topN/groupBy aggregations

Hi,

Let’s say we have a druid datasource: { dimensions: [name], metrics: [amount] }

I want to calculate the sum amount of the top N = 100 names (ordered by sum amount).

Normally, I do like this:

  1. Execute a topN/groupBy query with dimension = name to get top 100 result ordered by amount.

  2. Calculate sum of these amount values.

The execution become slow of I increase 100 to 1000000. Is there any other ways doing this? Thanks

Hi,

What is the SQL query you are trying to issue?

Hi Fangjin,

The corresponding SQL query is something like this:

SELECT SUM (new_amount)

FROM

(
SELECT TOP 100 name, SUM(amount) as sum_amount
GROUP BY name
ORDER BY sum_amount
) AS new_amount;

``

It would be really slow if I increase N from 100 to 1000000

Hi,

Have you looked into using https://github.com/implydata/plyql and let it query plan and issue the query for you?

– FJ

Thanks Fangjin,

I’ve tried some plyql and see that this query works:

SELECT (SELECT SUM (traffic) as sum_amount GROUP BY amount ORDER BY sum_amount DESC LIMIT 100) as subTotal FROM dataSource

``

It retrieved top 100 subTotal. However, what I want to achieve is to calculate the SUM of those subTotal values.

I tried this query but it said invalid syntax:

SELECT SUM(subTotal) FROM((SELECT SUM (traffic) as sum_amount GROUP BY amount ORDER BY sum_amount DESC LIMIT 100) as subTotal FROM dataSource)

``

In other words, I want to execute a SUM of values generated from a GROUP BY query.

Thanks

If you’re asking for top 1M items you’re probably not doing a topN anymore and are more likely doing a “download all the data” kind of query.
Internally we currently use lexicographic topN to paginate through the potential values and get the results.

For a simple sum of an amount over some interval of time, timeseries is a good option.

If you are just wanting the sum of the topN, then you can do two queries. One to get the topN results, and a second timeseries query with a filter on the dimension of the values of interest.

Thanks Charles,

Is there other ways of doing this?.. For example: using histogram / quantile calculation? As in my case, speed is more significant than accuracy.

Best Regards

Have you tried using nested groupBys?