Have a different aggregation for the time dimension than for other dimensions?

Hello,

I have been experimenting a bit with Druid and I have the following question: in a query, is it possible to specify a “granularity aggregator” (on the time dimension) that would be different from the aggregator ran against other dimensions?

For instance, let’s say I have 2 servers, each one with a certain number of active connections to another system. This number of active connections reflects a state at a given time and can go up and down. Every minute, each server sends its number. In Druid I have this data (server is a dimension):

11:03 server=server1 12

11:03 server=server2 42

11:04 server=server1 24

11:04 server=server2 33

11:05 server=server1 19

11:05 server=server2 31

Let’s say I want to draw a graph showing the maximum number of active connections, cross servers, with a “hour” granularity (just an example, could be “15min”, “day”, …).

If I run the query with “hour” as granularity and “sum” as the aggregator, it gives me: sum(12,42,24,33,19,31) = 161, which doesn’t mean anything in my case (no sense to sum different values from the same server).

Instead I would like to issue a query with

  • “hour” as the granularity

  • “max” as the aggregator against the time dimension

  • “sum” as the aggregator against other dimensions (server here)

The query would do:

  • Apply the granularity aggregator (max) on the granularity period

11:00 server=server1 24

11:00 server=server2 42

  • Apply the dimensions aggregator (sum)

11:00 66

–> And thus I would get 66.

Does it make sense? Do you know how I could achieve this?

Of course I could do it “manually” by running a groupBy query on all dimensions with the minute granularity and aggregating myself with post-processing but it means transferring a lot of data to the caller (+ probably performance impacts).

To illustrate differently, it seems that OpenTSDB calls this “downsampling”: http://opentsdb.net/docs/build/html/user_guide/query/index.html#downsampling

Thank you,

Thomas

Hi Thomas, I’m very confused by what you are trying to do. Can you describe how this is different than Druid’s concept of rollup?

Hello Fangjin,

Basically I would like to rollup on the timestamp dimension using an operator “max” and rollup on other dimensions using an operator “sum”.

From what I understood, in a Druid query you specify an operator and it is used to do the rollups on all dimensions (timestamp and others).

For instance

{
“queryType”: “groupBy”,
“dataSource”: “XXX”,
“granularity”: “hour”,
“aggregations”: [ {“type”: “longSum”, “name”:“metricAgg”, “fieldName”: “metric”} ],
“intervals”: [ “XXX” ]
}

``

will sum all the points in the same hour interval (so sum on the timestamp dimension to get “hour” granularity and sum on the other dimensions because not specified in “dimensions”).

What I would like is something like

{
“queryType”: “groupBy”,
“dataSource”: “XXX”,
“granularity”: “hour”,
“granularityAggregation”: “longMax”,
“aggregations”: [ {“type”: “longSum”, “name”:“metricAgg”, “fieldName”: “metric”} ],
“intervals”: [ “XXX” ]
}

``

which would

  1. rollup to “hour” granularity by taking the max for each dimensions combination (in my example above it results in 2 data points: 11:00 server=server1 24 and 11:00 server=server2 42)
  2. rollup other dimensions not in the “dimensions” list of the query by doing the sum (in my example rollup the “server” dimension so just one data point in the end: 11:00 66)

To do this in OpenTSDB, I would run

/api/query?start=48h-ago&m=sum:1h-max:metric

``

Thanks,

Thomas

Hi Thomas,

If I understand correctly, you have rows with a “server” dimension and a “num_connections” metric representing active connections at a given instant, and you want to:

  • Find the max “num_connections” per-server in each time bucket

  • Sum these max “num_connections” values across servers per time bucket

Would the following query be suitable for your use case?

inner query:

{

“queryType”: “groupBy”,

“dataSource”: “XXX”,

“granularity”: “hour”,

“dimensions”: [“server”],

“aggregations”: [ {“type”: “longMax”, “name”:“max_connections_per_server”, “fieldName”: “num_connections”} ],

“intervals”: [ “XXX” ]

}

outer query:

{

“queryType”: “groupBy”,

“dataSource”: (use type = “query”, insert inner query from above here),

“granularity”: “hour”,

“dimensions”: ,

“aggregations”: [ {“type”: “longSum”, “name”:“max_connections_total”, “fieldName”: “max_connections_per_server”} ],

“intervals”: [ “XXX” ]

}

You may also be interested in checking out PlyQL for expressing queries in SQL:

https://github.com/implydata/plyql

Thanks,

Jon