TopN query results seem to be wrong!

I’m trying to implement an easy query to calculate the trend of certain articles, based on the difference in click numbers from the previous/current hour.

When querying older data (from the historicals) the results seem plausible, but when querying recent data (middlemanager) the results seem to ignore chunks of data and are not usable for these types of trend calcs.

{
“queryType”:“topN”,
“dataSource”:“pageviews”,
“dimension”:“onlineId”,
“metric”: “score”,
“granularity”: {“type”: “duration”, “duration”: 7200000, “origin”: “2018-09-24T08:22:00.000Z”},
“threshold”: 100,
“intervals”:[
“2018-09-24T08:22:00.000Z/PT2H”
],
“aggregations”:[
{
“type”:“filtered”,
“filter”:{
“type”:“interval”,
“dimension”:"__time",
“intervals”:[
“2018-09-24T08:22:00.000Z/PT1H”
]
},
“aggregator”:{
“type”:“longSum”,
“name”:“total_old”,
“fieldName”:“count”
}
},
{
“type”:“filtered”,
“filter”:{
“type”:“interval”,
“dimension”:"__time",
“intervals”:[
“2018-09-24T09:22:00.000Z/PT1H”
]
},
“aggregator”:{
“type”:“longSum”,
“name”:“total_new”,
“fieldName”:“count”
}
}
],
“postAggregations”:[
{
“type”:“arithmetic”,
“name”:“score”,
“fn”:"-",
“fields”:[
{
“type”:“fieldAccess”,
“fieldName”:“total_new”
},
{
“type”:“fieldAccess”,
“fieldName”:“total_old”
}
]
}
]
}

``

This query is sorted by the metric: “score” and the result is:

“result”: [
{
“total_old”: 0,
“onlineId”: “10264391”,
“total_new”: 12364,
“score”: 12364
},
{
“total_old”: 1016,
“onlineId”: “10267880”,
“total_new”: 9946,
“score”: 8930
},
{
“total_old”: 0,
“onlineId”: “10266584”,
“total_new”: 2404,
“score”: 2404
},
{
“total_old”: 0,
“onlineId”: “10147655”,
“total_new”: 2313,
“score”: 2313
}
]

``

But when sorting by “total_new” the results differ completely:

[
{
“timestamp”: “2018-09-24T08:22:00.000Z”,
“result”: [
{
“total_old”: 14436,
“onlineId”: “10264391”,
“total_new”: 13265,
“score”: -1171
},
{
“total_old”: 13058,
“onlineId”: “10267880”,
“total_new”: 9946,
“score”: -3112
},
{
“total_old”: 3297,
“onlineId”: “10147655”,
“total_new”: 2529,
“score”: -768
},
{
“total_old”: 2663,
“onlineId”: “10266584”,
“total_new”: 2518,
“score”: -145
}
]

``

I’m using druid version 0.12.3 with queryGranularity = none and segmentSize = day.

When using groupBy instead of topN the results seem to be right. Does topN make sense in this context (to improve query/time)? Or is it possible to calculate this during ingestionTime somehow?

Thanks and Regards,

Max

If you add a filter for ONLY those IDs do the topN and groupBy results match?

If so you have hit upon aliasing issues related to the way the approximate-topN works.

GroupBy queries seem to return the right results. The strange thing is when a part of the data is stored on the historical nodes and the other part is on the middle manager (apparently in the exact time frame, when druid is merging the query results of realtime data with historical data), this kind of topN query doesn’t work. (When the data is on the historicals or MM only, both queries work just fine.). Also, when I change the interval just a tiny bit (reducing the second period by 5 minutes for instance), it seems to return pretty much the same results as groupBy does.