[druid-user] Percentile calculation in Druid

Hi folks,
I’m trying to calculate percentiles in druid query console. From what I understand from documentation, APPROX_QUANTILE_DS(), DS_GET_QUANTILE() etc. provided in Datasketches extension prove to be useful to reach an approximate value. But is there an way to fetch interpolated values from druid that are equivalent to standard results.
Case -

Dataset

5.71, 6.33, 9.64, 10.74, 18.39, 29.92, 57.06, 58.41, 113.27, 119.73

Percentiles

Excel/Actual Druid

6.268 6.33 10th Percentile

9.915 9.64 25th Percentile

24.155 29.92 50th Percentile

58.0725 58.41 75th Percentile

113.916 113.27 90th Percentile

Standard Percentile Calculation (50th)

  1. Find index/rank: ((50/100)*(10+1)) = 5.5

  2. If it is a whole number x, the answer is the datapoint at xth position.

  3. If it is not a whole number, do interpolation.

5.5 → integer part: 5, fractional part: 0.5

3i. Take (rank) and (rank+1) rows: 18.39 and 29.92 and multiply by the fractional part.

(29.92 - 18.39) * 0.5 = 5.765

3ii. Add it to the smaller value (18.39):

18.39 + 5.765 = 24.155

Hi Neelanshi,

The DataSketch approximation algorithm does not necessarily use all the values in the series. The calculation process is different, it is designed to provide calculation speed/efficiency, its accuracy is described here:

https://datasketches.apache.org/docs/Quantiles/OrigQuantilesSketch

Sergio

I don’t know whether this can be calculated with 100% accuracy using druid rollup and tools. However, you can increase the accuracy of the sketch by increasing logK (as described in the doc Sergio linked to.

The quantiles sketch does not do interpolation. The result of getQuantile(rank) is always one of the values the sketch has seen.