Nested GroupBy

I’m trying to do nested groupBy and I’m a little confused about the part of using a subquery result as the data source. So for table data, I need to specify the metricsSpec when doing injection, which I can then refer to when I do aggregation in the query later. But when I try to use the result of a subquery as the data source, how do I specify the metricsSpec in the subquery result? Or more generally, is there any example for nested groupBy query? Thanks!

Hi Qi,

We are lacking some documentation in this area. Your best bet is this documentation page: http://druid.io/docs/latest/DataSource.html, where you can nest an entire groupby within another groupby. I am also curious about your use case, because I think you may be able to accomplish similar answers using iterated topNs, which should be significantly faster. You can also take a look at the facet.js open source library (http://facetjs.org/), which you can use to issue iterated topNs (topNs of topNs). Let me know if this makes sense.

– FJ

Thanks for the reply. If I’m going to do topN of topN, I still have the problem of using the 1st topN query result as the dataSource of the 2nd topN query result right? So should I inject the 1st topN query result into Druid first and then post the 2nd topN query? Or is there a way to do nested topN?

For example, if I have the data table with the schema of
dim1, dim2, dim3, dim4

and what i want to do is

SELECT
dim1,

dim2
COALESCE(SUM(value), 0) as sum,
COALESCE(SUM(valuevalue), 0) as square_sum
FROM (
SELECT
dim1,
dim2,
dim3,
count(
) as value
FROM
data_table
GROUP BY
dim1,
dim2,
dim3
) t
GROUP BY
dim1,
dim2
;

  1. How can I solve it by topN of topN?

  2. I also notice that I can write some custom aggregation functions in the query json file to solve this problem but I’m not sure if there is going to be any performance issues.

Thanks!

The easiest way to do nested topNs is to look at the facet.js library. It has built in support out of the box and is generally to work with than Druid’s native query language.

You can write custom javascript aggregators. This is useful for prototyping and what we use it for. Writing a native java aggregator is about twice as fast as the javascript aggregator.

I see. Thanks! Will investigate those.

Hi Fangjin,

Just want to report a small issue. When I was doing nested groupby like follows, Druid encountered an error says “[value] already defined” and if I remove any one of the two aggregation functions(colored in blue), the other one works fine. It does not make sense to me since in some use cases we need to use the same column in different aggregation function.

“queryType”: “groupBy”,

“dataSource”:{

“type”: “query”,

“query”: {

“queryType”: “groupBy”,

“dataSource”: “platform”,

“granularity”: “all”,

“dimensions”: [“treatment_name”, “subject_id”, “dim_app_family”],

“aggregations”: [

{ “type”: “longSum”, “name”: “value”, “fieldName”: “count” }

],

“intervals”: [ “2012-01-01T00:00:00.000/2020-01-03T00:00:00.000” ]

}

},

“granularity”: “all”,

“dimensions”: [“treatment_name”, “dim_app_family”],

  "aggregations": [

    { "type": "longSum", "name": "first_moment", "fieldName": "value" },

    { "type": "javascript",

      "name": "second_moment",

      "fieldNames": ["value"],

      "fnAggregate": "function(current, v) { return current + v * v; } ",

      "fnCombine": "function(partialA, partialB) { return partialA + partialB; }",

      "fnReset": "function() { return 0; }"

    }

    ],

“intervals”: [ “2012-01-01T00:00:00.000/2020-01-03T00:00:00.000” ]

}

Hi Qi,

Instead of nested groupBy, can you run groupBy over dim1,dim2 with cardinality aggregator over dim3 ?

For the error seen in nested groupBy query, It definately seems like a bug. Can you file a github issue for this with complete stack trace of exception?

Yeah, looks like a bug, I’ve added an issue and likely cause

Qi, Please add more details that you might have.

– Himanshu

Hi Nishant,

Thanks for the reply. I tried running groupBy over dim1,dim2 with cardinality aggregator over dim3(colored in blue) but the result does not seems to be correct. I notice in the documentation it says "

Cardinality aggregator

Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality."

so am I getting an approximated result by using cardinality aggregator?

Thanks,

Qi

{

    "queryType": "groupBy",

    "dataSource": "platform",

    "granularity": "all",

    "dimensions": ["treatment_name", "dim_app_family"],

    "aggregations": [

    { "type": "cardinality",

      "name": "first_moment",

      "fieldNames": [ "subject_id" ],

      "byRow": true

    }

    ],

    "intervals": [ "2012-01-01T00:00:00.000/2020-01-03T00:00:00.000" ]

}

The result will be approximate, but should be faster than calculating out the exact result.

The result should generally be within a few low percentage points of the actual value.

Hi Fangjin,

I see thanks. But in my case, the output is kind of a random number.

some examples are,

output value actual value

898335.8585119019 698030

19664.608641596507 1829621

I think the query may be incorrect, but I’d be curious what the SQL equivalent query you are trying to issue.

Also, if you issue the following query, how does it compare to the actual unique number of combinations for treatment_name and dim_app_family?

{

“queryType”: “timeseries”,

“dataSource”: “platform”,

“granularity”: “all”,

“aggregations”: [

{ “type”: “cardinality”,

“name”: “first_moment”,

“fieldNames”: [“treatment_name”, “dim_app_family”],

“byRow”: true

}

],

“intervals”: [ “2012-01-01T00:00:00.000/2020-01-03T00:00:00.000” ]

}

  1. The equivalent druid query is in the same thread as follows in blue.
  2. I queried the same data set on hive for reference. I also issued the following query to druid and the output is very close to hive(not exactly the same, so I guess for nested groupBy, there is also approximation?)

thanks.

{

“queryType”: “groupBy”,

    "dataSource":{

        "type": "query",

        "query": {

            "queryType": "groupBy",

            "dataSource": "platform",

            "granularity": "all",

            "dimensions": ["treatment_name", "subject_id", "dim_app_family"],

            "aggregations": [

            { "type": "longSum", "name": "value", "fieldName": "count" }

            ],

            "intervals": [ "2012-01-01T00:00:00.000/2020-01-03T00:00:00.000" ]

        }

    },

    "granularity": "all",

    "dimensions": ["treatment_name", "dim_app_family"],

    "aggregations": [

    { "type": "longSum", "name": "first_moment", "fieldName": "value" }

    ],

    "intervals": [ "2012-01-01T00:00:00.000/2020-01-03T00:00:00.000" ]

}

Hi Qi, can you provide the SQL query?

Hi Qi,

Do you mind sharing the full Druid query that generated those numbers as well as the query you used to compute the expected values?

That would help us figure out what might be going wrong.

Hi Fangjin,

I sent a private message to you.

Hi Xavier,

I sent a private message to you about the SQL query. Thanks!

Hi Wang,
I tried this query for my use case but its returning empty array for me

Hi Yang,
Have you found the solution for this.I tried the queries shared by Wang.but its returning empty result.