If require exact aggregates, why need two queries in topN?

I read the docment in http://druid.io/docs/0.9.2/querying/topnquery.html

there is

“Users who can tolerate approximate rank topN over a dimension with greater than 1000 unique values, but require exact aggregates can issue two queries. One to get the approximate topN dimension values, and another topN with dimension selection filters which only use the topN results of the first.”

why can’t get the result in first query?

Example First query:

{
    "aggregations": [
             {
                 "fieldName": "L_QUANTITY_longSum",
                 "name": "L_QUANTITY_",
                 "type": "longSum"
             }
    ],
    "dataSource": "tpch_year",
    "dimension":"l_orderkey",
    "granularity": "all",
    "intervals": [
        "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"
    ],
    "metric": "L_QUANTITY_",
    "queryType": "topN",
    "threshold": 2
}

Example second query:

{
    "aggregations": [
             {
                 "fieldName": "L_TAX_doubleSum",
                 "name": "L_TAX_",
                 "type": "doubleSum"
             },
             {
                 "fieldName": "L_DISCOUNT_doubleSum",
                 "name": "L_DISCOUNT_",
                 "type": "doubleSum"
             },
             {
                 "fieldName": "L_EXTENDEDPRICE_doubleSum",
                 "name": "L_EXTENDEDPRICE_",
                 "type": "doubleSum"
             },
             {
                 "fieldName": "L_QUANTITY_longSum",
                 "name": "L_QUANTITY_",
                 "type": "longSum"
             },
             {
                 "name": "count",
                 "type": "count"
             }
    ],
    "dataSource": "tpch_year",
    "dimension":"l_orderkey",
    "filter": {
        "fields": [
            {
                "dimension": "l_orderkey",
                "type": "selector",
                "value": "103136"
            },
            {
                "dimension": "l_orderkey",
                "type": "selector",
                "value": "1648672"
            }
        ],
        "type": "or"
    },
    "granularity": "all",
    "intervals": [
        "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"
    ],
    "metric": "L_QUANTITY_",
    "queryType": "topN",
    "threshold": 2
}

Quite simply, because each historical does a local topN (top 1000 with the default configuration) before passing the results to the broker which does the topN given the result of the top 1000’s. So, if you have data with a pretty good distribution, and are wanting to compare the top values, they will usually be the top values in all subsets as well. BUT if your data is not very distributed (in such a case topN barely makes any sense anyways), AND you have more than 1000 dimension values, then the result suffers from sample aliasing meaning you are truncating the long tail of the sub-samples before you merge them.

Imagine for a moment if you had a dataSource that was a list of active US senators and the US State they represent split among 2 historical nodes. Now imagine each historical node can only hold 10 results in its local results that it sends back up.

If you ask for the “top 5 US states ranked by number of US senators”, then each node will take the results it has locally, but only 10 of them. So you’ll have local awareness of somewhere between 0 and 2 senators per state. But you only get to return 10 of them. Well, the ones you happened to have 2 of will go first, then if you have any with 1 they will go in some order, then the ones with 0 in some order. Take the “top” 10 and kick them up a level in the query hierarchy for further processing.

So now at the broker level we see two sets of 10 results each. We can merge them together, but at this point some states will be known to have 2 senators, some to only have 1, and some won’t even be represented! Thus the ordering is approximate and the results are approximate for the final top 5 we asked for.

Now, let’s say that we can return 50 (or more) results from our two historicals. NOW we know we have a full result set coming back from each of the historicals, and can guarantee the results are completely represented. So the results are in the correct “order” and are completely accurate in their results.

Now lets say that we take the original result of 5, and do a SECOND query filtering on ONLY those 5. We will still get an approximate ordering, but the numerical value of our metrics are guaranteed to be accurate since each sub-set on each historical completely contains all the results we are asking for.

Hopefully that helps a little.