"Inverse" quantile post-aggregations


With the approximated histograms we can obtain approximated quantiles in the post-aggregation phase,

  • Is it possible to obtain the associated quantile to a given value?
  • If not (I imagine that’s the case), is it possible to contribute code to add this feature using the same underlying data structures? (My intuition says me that this should be possible without the need of other data structures, but I’m not entirely sure, I think that in the worst case a binary search could be used over the histogram data structure to obtain the related percentile to a given number).
    I’m asking about contributing and I haven’t did it yet, but It’s my aim to do it (mainly on the aggregations layer) in a few weeks.

Thank you for your time and attention.

Hi acorrea,
I dont think you can do inverse quantile with the current code, but this should not be too difficult to add.

Have a look at get getQuantiles method in ApproximateHistogram


you should be able to add another method like float getProbablity(float value) by using the underlying data structures.

I think this would be useful for the community and looking forward for the contribution.

Let me know if you face any issues or blockers.

One complication is rollup

If things were not rolled up you could have an aggregate that simply counted the number of results which were more vs less and report the result.

BUT with rollup that becomes much more complicated.

It might be worth talking with https://github.com/jon-wei (ping him in a github issue) because aggregating on numeric dimensions is one way to solve this.

Other than that I am not familiar with any algorithms for approximating quantile given the value and allowing rollup, but am by no means an expert in that area. Will be interesting to see what you come up with.

Also it would be worth filing a github issue before you start developing code.

Hi Nishant,

thank you for your response!

about the method name, I’m not sure about choosing “getProbability”, because the “correct” term should be something related to “cumulative relative frequency”. The problem with the “correct” term is that’s very long, probably too long.