TopNMetricSpec - Performance Clarification

Hi All,

Problem statement:

We need to find all the distinct userIDs from huge dataset (~100Millions) and so experimenting with TopNMetricSpec so that we get userIDs based on the set threshold.

Can you anyone help me to understand how TopNMetricSpec runs?

If I run following TopN query with TopNMetricSpec repeatedly for ‘n’ times using the same http client,

then I want to know will this scan all the records every time when we set previousStop.

Consider the following Data:

┌──────────────────────────┬─────────┬────────┬────────┐

│ __time │ movieId │ rating │ userId │

├──────────────────────────┼─────────┼────────┼────────┤

│ 2015-02-05T00:10:09.000Z │ 2011 │ 3.5 │ 215 │

│ 2015-02-05T00:10:26.000Z │ 38061 │ 3.5 │ 215 │

│ 2015-02-05T00:10:32.000Z │ 8981 │ 2.0 │ 215 │

│ 2015-02-05T00:11:00.000Z │ 89864 │ 4.0 │ 215 │

│ 2015-02-23T23:55:08.000Z │ 56587 │ 1.5 │ 31 │

│ 2015-02-23T23:55:33.000Z │ 51077 │ 4.0 │ 31 │

│ 2015-02-23T23:55:35.000Z │ 49274 │ 4.0 │ 31 │

│ 2015-02-23T23:55:37.000Z │ 30816 │ 2.0 │ 31 │

│ 2015-03-19T14:24:01.000Z │ 5066 │ 5.0 │ 176 │

│ 2015-03-19T14:26:23.000Z │ 6776 │ 5.0 │ 176 │

│ 2015-03-29T16:19:58.000Z │ 2337 │ 2.0 │ 96 │

For eg: in the following query

  1. Initially, I have set previous stop as null and threshold has two so it will fetch first two records (because threshold = 2) viz. 215, 176

  2. Now, I will pass previous stop = 176 now the question is will the broker scan all the records again or will it just scan from where it stoped after step 1 i.e. 176?

{

“queryType”: “topN”,

“dataSource”: “ratings30K”,

“intervals”: “2015-02-05T00:00:00.000Z/2015-03-30T00:00:00.000Z”,

“granularity”: “all”,

“dimension”:“userId”,

“threshold”: 2,

“metric”: {

“type”: “inverted”,

“metric”: {

"type": "dimension",

"ordering": "Numeric",

"previousStop": null

}

}

}

Thank you,

Niraj Dedhia