groupBy with limitSpec question

[Moving the question from Druid Development group - https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/druid-development/kO8KqAzbj-s

Hi All,

Looks like the groupBy with limitSpec, when queried across multiple days, is returning the results from only the 1st day in the interval ( we are querying for 03/17 to 03/23 for 10000 records in descending order by a metric but we are seeing results only for 03/17). The expected behavior is that order and limit should be happening across all days. Can someone help understand why it’s happening.

“intervals”: [“2015-03-17T08:00:00.000Z/2015-03-23T07:59:59.999Z”],

“limitSpec”: {

“type”: “default”,

“limit”: 10000,

“columns”: [{

“dimension”: “impressions”,

“direction”: “descending”

}]

Query :

{

“dataSource”: “marketing-rollup-day”,

“queryType”: “groupBy”,

“granularity”: {

“type”: “period”,

“period”: “P1D”,

“timeZone”: “-08:00”

},

“dimensions”: [“e”, “ercc”, “dvis”, “page”, “greg”, “gccd”, “gcit”, “dvid”, “btyp”, “bver”],

“filter”: {

“type”: “selector”,

“dimension”: “channel”,

“value”: “dp”

},

“threshold”: 0,

“aggregations”: [{

“type”: “longSum”,

“name”: “req”,

“fieldName”: “rq”

}, {

“type”: “longSum”,

“name”: “impressions”,

“fieldName”: “im”

}, {

“type”: “longSum”,

“name”: “clicks”,

“fieldName”: “cl”

}, {

“type”: “longSum”,

“name”: “os”,

“fieldName”: “os”

}, {

“type”: “longSum”,

“name”: “conversions”,

“fieldName”: “cv”

}, {

“type”: “longSum”,

“name”: “roi”,

“fieldName”: “roi”

}, {

“type”: “longSum”,

“name”: “dismisses”,

“fieldName”: “di”

}],

“intervals”: [“2015-03-17T08:00:00.000Z/2015-03-23T07:59:59.999Z”],

“limitSpec”: {

“type”: “default”,

“limit”: 10000,

“columns”: [{

“dimension”: “impressions”,

“direction”: “descending”

}]

}

}

We are using Druid 0.6.160. Could it be a bug in the code ?

Hi Venkatesh, can you confirm this behavior with the latest code in master?

Fangjin, by latest code do you mean 0.7.0 ? If so, can we just try 0.7.0 broker against rest of the services in 0.6.160 ?

Hi Venkatesh,

by latest code in master i believe what FJ means the current code in master branch of github or 0.7.1-RC1 release,

I suspect that 0.7.1-RC1 broker might not be able to backwards compatible with 0.6.160 historicals,

this would require you upgrading your historicals and brokers both.

Hi Nishant,

I tried with 0.7.0 and the limitSpec is still limiting the results sorted within a day instead of sorting across the date range specified in the query.

Before 0.7.0, I tried with 0.7.1-RC1 and the index task itself had a problem with our data/use case. I posted about it here before trying 0.7.0 https://groups.google.com/forum/#!topic/druid-user/g6pfDw8WR1k

Hi Venkatesh, we suspect this a bug on our side. Is it possible to contribute a unit test to help us find the error faster?

GroupByQueryRunnerTest.java is a good place to start for some example unit tests

Hi Fangjin,

I am trying to write a unit test that would surface this error.

I am checking the existing Unit test GroupByQueryRunnerTest.testGroupByWithOrderLimit4 for reference and I am trying to figure out how is the datasource QueryRunnerTestHelper.dataSource mocked. Can you help me figure out where the rows are mocked for this datasource. I looked into resource files too but no luck.

Hi Venkatesh,

The test data used is in:

resources/druid.sample.tsv

That data is loaded up for unit tests and queries run against that data.

One way to reproducing your error is to change:

.setGranularity(QueryRunnerTestHelper.allGran)

to another granularity and see if the results that come back make sense.

Hi Fangjn,

I have created a pull request with unit test which surfaces this bug.

Can you please take a look at it.

Thanks Venkatesh, I’ll continue the discussion in the pull request.

Hi Venkatesh, I took a closer look at your issue based on your PR (thanks for the contrib!) and have discussed with a few other committers.

The current way LIMIT functions is the same as in SQL and LIMIT restricts the number of rows returned in the overall result.

I believe what you are looking for is a limit per result time bucket, which will require some development work in Druid and require some new syntax.

You can do a workaround of issuing multiple queries to get the results you require, or if you are building visualizations on top of Druid, take a look at the open source http://facetjs.org/ library, which I believe provides support for your request out of the box.

Hi Fangjin, thanks for looking in to this. I will take a look at facetjs.org.

BTW, just to make sure :

The current way LIMIT functions is the same as in SQL and LIMIT restricts the number of rows returned in the overall result.

When we have limitspec with sort on a metric, the results are being restricted to the first time bucket, not in the overall sorted results.

Hi Fangjin,

Adding to my last comment, is it possible to just get the limit working on top of sorted results (sorted across all the intervals) without much of the development work.

Thanks,

Venkatesh

Hi Venkatesh, perhaps I am misunderstanding your requirements, but I think you are describing the use case in https://github.com/druid-io/druid/pull/1296/files, which is limiting the results across multiple days.

Shouldn’t limiting by a specific time bucket be done by default just by setting the time granularity differently?

Could this not just be a case of the bug that Bingkun sent a PR for recently where group by is applying its having/limit spec locally on historicals instead of waiting for it to actually aggregate together on the broker?

–Eric