I just filed a PR with the failing test right now: <https://github.com/druid-io/druid/pull/4744>.
I used the sample data set from <https://metamarkets.com/2013/histograms/> as it demonstrated the problem of quantiles returning max. I’ve been comparing the implementation of getQuantiles() with the BH/TT paper as well as looking into the histogram construction itself. Using the data set above:
Original dataset: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 12, 12, 15, 20, 25, 25, 25
ApproximateHistogram{size=20, lowerLimit=-Infinity, upperLimit=Infinity, positions=[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 12.0, 15.0, 20.0, 25.0], bins=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3], binCount=14, min=1.0, max=25.0, count=18}
Histogram{breaks=[-0.26315784454345703, 1.0, 2.263157844543457, 3.526315689086914, 4.789473533630371, 6.052631378173828, 7.315789222717285, 8.578947067260742, 9.8421049118042, 11.105262756347656, 12.368420600891113, 13.63157844543457, 14.894736289978027, 16.157894134521484, 17.421051025390625, 18.684207916259766, 19.947364807128906, 21.210521697998047, 22.473678588867188, 23.736835479736328, 25.0], counts=[1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 3.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 3.0]}
And walking through the quantiles (q=0.1 through 0.95):
0.1 1.8
0.2 3.6
0.3 5.3999999999999995
0.4 7.2
0.5 9.0
0.6 11.049390153191919
0.7 12.376388603226825
0.8 17.0
0.9 23.520797289396146
0.95 25.164854858377943
I’ve been trying to figure out if this is a case where the algorithm for dealing with the trapezoid falls under the special case as outlined in the paper (section 2.1, paragraph 4):
The
case where b < p1 or b > pB requires special treatment. One possibility is to add two dummy bins
(p0,0) and (pB+1,0), where p0 and pB+1 are chosen using prior knowledge, according to which all or almost all the points in S are in the interval [p0, pB+1] (p0 and pB+1 can be determined on the fly
during the histogram’s construction).
Though I’m not sure how adding the dummy bins might help since their count is 0, and that wouldn’t do much in the summation.
I’ve got a different test case where the numbers are worse even with q=0.25 and 0.50, but I’ve yet to determine whether the histogram breaks is at fault there. The dataset above in the first example is small enough to deal in exposing this edge case.
In cases like this, I’m not sure if bins[i-1] becomes the min (floor) and bins[i] becomes the max (ceiling), such that the returned value for the quantile becomes the min or max if the calculated value is outside that predicted range.