limitSpec for groupBy with time bucket

  1. I would like to understand the behavior of the limitSpec for groupBy query with time bucket.
    It appears that the limit is applied to the number of buckets instead of dimensions. Is this expected behavior?

  2. I would like to apply the limit to the dimension keys, but there doesn’t seem to be a simply way to do it.

Below is how I do it for now. Is there a better way?

Taking the metrics-kafka example, I want get the top 10 hits by “server” and “page” dimensions.

First, run groupBy query with granularity “all”, limit 10.

{

“queryType”: “groupBy”,

“dataSource”: “metrics-kafka”,

“granularity”: “all”,

“dimensions”: [

“server”,

“page”

],

“aggregations”: [

{

“name”: “hits”,

“type”: “count”

}

],

“intervals”: [

“2016-11-01T00:00:00.000Z/2016-11-15T00:00:00.000Z”

],

“limitSpec”: {

“type”: “default”,

“limit”: 3,

“columns”: [

{

“dimension”: “hits”,

“direction”: “descending”

}

]

}

}

And here is the result

[ {

“version” : “v1”,

“timestamp” : “2016-11-01T00:00:00.000Z”,

“event” : {

“hits” : 21,

“server” : “www3.example.com”,

“page” : “/”

}

}, {

“version” : “v1”,

“timestamp” : “2016-11-01T00:00:00.000Z”,

“event” : {

“hits” : 20,

“server” : “www2.example.com”,

“page” : “/”

}

}, {

“version” : “v1”,

“timestamp” : “2016-11-01T00:00:00.000Z”,

“event” : {

“hits” : 20,

“server” : “www5.example.com”,

“page” : “/”

}

}, {

“version” : “v1”,

“timestamp” : “2016-11-01T00:00:00.000Z”,

“event” : {

“hits” : 18,

“server” : “www1.example.com”,

“page” : “/”

}

}, {

“version” : “v1”,

“timestamp” : “2016-11-01T00:00:00.000Z”,

“event” : {

“hits” : 18,

“server” : “www4.example.com”,

“page” : “/”

}

}, {

“version” : “v1”,

“timestamp” : “2016-11-01T00:00:00.000Z”,

“event” : {

“hits” : 17,

“server” : “www5.example.com”,

“page” : “/list”

}

}, {

“version” : “v1”,

“timestamp” : “2016-11-01T00:00:00.000Z”,

“event” : {

“hits” : 16,

“server” : “www1.example.com”,

“page” : “/list”

}

}, {

“version” : “v1”,

“timestamp” : “2016-11-01T00:00:00.000Z”,

“event” : {

“hits” : 15,

“server” : “www3.example.com”,

“page” : “/list”

}

}, {

“version” : “v1”,

“timestamp” : “2016-11-01T00:00:00.000Z”,

“event” : {

“hits” : 15,

“server” : “www4.example.com”,

“page” : “/list”

}

}, {

“version” : “v1”,

“timestamp” : “2016-11-01T00:00:00.000Z”,

“event” : {

“hits” : 13,

“server” : “www2.example.com”,

“page” : “/list”

}

} ]

Then build filter using the above result, and run groupBy query with period granularity and filters

{

“queryType”: “groupBy”,

“dataSource”: “metrics-kafka”,

“granularity”: {

“type”: “period”,

“period”: “PT5M”,

“timeZone”: “Etc/UTC”

},

“dimensions”: [

“server”,

“page”

],

“aggregations”: [

{

“name”: “hits”,

“type”: “count”

}

],

“intervals”: [

“2016-11-01T00:00:00.000Z/2016-11-15T00:00:00.000Z”

],

“limitSpec”: {

“type”: “default”,

“columns”: [

{

“dimension”: “hits”,

“direction”: “descending”

}

]

},

“filter”: {

“type”: “or”,

“fields”: [

{

“type”: “and”,

“fields”: [

{

“type”: “selector”,

“dimension”: “server”,

“value”: “www1.example.com

},

{

“type”: “selector”,

“dimension”: “page”,

“value”: “/”

}

]

},

{

“type”: “and”,

“fields”: [

{

“type”: “selector”,

“dimension”: “server”,

“value”: “www1.example.com

},

{

“type”: “selector”,

“dimension”: “page”,

“value”: “/list”

}

]

}

]

}

}

Thanks!

Kenji

Hey Kenji,

groupBy behavior with granularity other than “all” is somewhat non-intuitive. See this issue: https://github.com/druid-io/druid/issues/1926. In 0.9.2, another approach is to stick with granularity “all”, but including a “time” dimension using a time format extraction function with “granularity” set. For example:

“dimensions”: [

{

“type” : “extraction”,

“dimension” : “__time”,

“outputName” : “time_5min”,

“extractionFn” : {

“type” : “timeFormat”,

“granularity” : {

“type”: “period”,

“period”: “PT5M”,

“timeZone”: “Etc/UTC”

}

}

},

“server”,

“page”

]

Then the limiting behavior will be more like what you expect.

See 0.9.2-rc2 docs here for more details: http://druid.io/docs/0.9.2-rc2/querying/dimensionspecs.html (the final hasn’t been released yet)

Oh, wait, I might have misunderstood you. Are you saying you want to get a 5min timeseries of the top 10 server/page combos? Doing that with two queries, like you’re doing now, sounds like the best way to me.

Thanks, Gian.

Are you saying you want to get a 5min timeseries of the top 10 server/page combos?

Yes, that is what I’m trying to achieve. I’ve read the #1926. I think it’s addressing different issue but your comments in the thread gave me the insight as to how limit / order works. The query result that I’m trying to achieve is conceptually 3D data (metrics x dimensions x timeseries). It would be nice to be able to control to which field the limit / order to apply.