approxHistogramFold returning quantile values greater than max?

Hi, all

I recently ran into an issue where the approxHistogramFold values for quantiles seemed off. Specifically, the values returned were greater than what max returns (whether using doubleMax on the field, or max on the approxHistogram field).

For example:

max: 97.6

p95: 121.7

As an experiment, I’ve tried:

  • increasing the resolution (from default to 200, 2000, 10,000)

  • setting a lowerLimit and upperLimit range (0.0, 100.0)

  • making sure the source data is capped between 0.0 - 100.0, and only having up to 1 decimal place

I checked the mailing list and GitHub for similar issues and did find something related:

I’m looking at the approx histogram fold code right now to see if I can get more information, but would appreciate any pointers.

I’m using imply 2.3.1 / druid 0.10.1 (though this also happens in druid 0.10.0).


  • gino

I’m not too familiar with the approx histogram code, but, that behavior does seem weird. Could you write a test case that reproduces this and submit the (failing) test case as a PR with an @Ignore annotation? Or, even better, a fix – but even a failing test would be helpful.

It might be due to some issue with how the quantiles are computed. I think the min/max are computed in a really straightforward way so they are probably ok.

Yes, I’m reviewing the getQuantiles() portion of the code right now along with the existing test case.

I can write this up and document my findings.


Awesome, thanks! It might also be worth checking out if the “toHistogram” methods return something that makes any sense – more indication of where the problem might lie. In addition to inspecting the centroids themselves stored in the positions and bins arrays.

I just filed a PR with the failing test right now: <>.

I used the sample data set from <> as it demonstrated the problem of quantiles returning max. I’ve been comparing the implementation of getQuantiles() with the BH/TT paper as well as looking into the histogram construction itself. Using the data set above:

Original dataset: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 12, 12, 15, 20, 25, 25, 25

ApproximateHistogram{size=20, lowerLimit=-Infinity, upperLimit=Infinity, positions=[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 12.0, 15.0, 20.0, 25.0], bins=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3], binCount=14, min=1.0, max=25.0, count=18}

Histogram{breaks=[-0.26315784454345703, 1.0, 2.263157844543457, 3.526315689086914, 4.789473533630371, 6.052631378173828, 7.315789222717285, 8.578947067260742, 9.8421049118042, 11.105262756347656, 12.368420600891113, 13.63157844543457, 14.894736289978027, 16.157894134521484, 17.421051025390625, 18.684207916259766, 19.947364807128906, 21.210521697998047, 22.473678588867188, 23.736835479736328, 25.0], counts=[1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 3.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 3.0]}

And walking through the quantiles (q=0.1 through 0.95):

0.1 1.8

0.2 3.6

0.3 5.3999999999999995

0.4 7.2

0.5 9.0

0.6 11.049390153191919

0.7 12.376388603226825

0.8 17.0

0.9 23.520797289396146

0.95 25.164854858377943

I’ve been trying to figure out if this is a case where the algorithm for dealing with the trapezoid falls under the special case as outlined in the paper (section 2.1, paragraph 4):

case where b < p1 or b > pB requires special treatment. One possibility is to add two dummy bins
(p0,0) and (pB+1,0), where p0 and pB+1 are chosen using prior knowledge, according to which all or almost all the points in S are in the interval [p0, pB+1] (p0 and pB+1 can be determined on the fly
during the histogram’s construction).

Though I’m not sure how adding the dummy bins might help since their count is 0, and that wouldn’t do much in the summation.

I’ve got a different test case where the numbers are worse even with q=0.25 and 0.50, but I’ve yet to determine whether the histogram breaks is at fault there. The dataset above in the first example is small enough to deal in exposing this edge case.

In cases like this, I’m not sure if bins[i-1] becomes the min (floor) and bins[i] becomes the max (ceiling), such that the returned value for the quantile becomes the min or max if the calculated value is outside that predicted range.

  • gino

I’ve updated the PR, and after re-reading the section on the uniform procedure of the BH/TT paper, it does seem like using the calculated max value when the computed uj exceeds it makes sense, since the uniform procedure (which getQuantiles is based off) is expected to return a set of numbers from u1 < … < uB̃, where uB̃ in this case is the value at the (quantile * num_samples) position.

Applying that tiny change seems to satisfy a variety of synthetic data sets I’ve generated, including the ones in the unit test and the dataset provided by another person who filed <>. With the fix, getQuantiles() returns 189.12 (original 1167.81, expected 190).

  • gino

That sounds great, thank you for looking into it! I’ll take a look at your patch as soon as I can.