quantilesDoubleSketchToHistogram query issue

Hi,

I am a new user to Druid. I was using the quantilesDoubleSketchToHistogram query.

I think the result being delivered is incorrect. Requesting for further light on this topic.

MY QUERY :

{
“queryType”: “groupBy”,
“dataSource”: “sample1”,
“granularity”: “hour”,
“dimensions”: [
{“type”: “default”, “dimension”: “appid”, “outputName”: “application_id”}
],
“aggregations”: [
{
“type” : “quantilesDoublesSketch”,
“name” : “count_appid”,
“fieldName” : “appid”
}
],

“postAggregations”: [
{
“type” : “quantilesDoublesSketchToHistogram”,
“name” : “histogram_count_appid”,
“field” : { “type” : “fieldAccess”, “fieldName” : “count_appid”},
“splitPoints” : [1000.0,2000.0]
}
],
“intervals”: [“2020-02-07T00:00:00.000Z/2020-02-07T23:59:00.000Z”]
}

MY RESULT :

timestamp : 2020-02-07T06:00:00.000Z

histogram_count_appid : 400

count_appid : 4

application_id : 1044534198

timestamp : 2020-02-07T06:00:00.000Z

histogram_count_appid : 200

count_appid : 2

application_id : 1057889290

and so on…

I have taken reference of the this chat

URL : https://github.com/apache/druid/issues/6853

"Also i calculated the quantiles of [0.50, 0.75, 0.90, 0.95] and the histograms of [ 0.0, 200.0, 400.0, 600.0, 800.0, 1000.0, 1200.0, 1400.0, “Infinity” ] by myself. They were [100, 1150, 1772, 1886] and [ 6.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 3.0 ].
Compared the actual result with the query result, i found the quantile query of approximate histogram was more accurate than quantiles
sketch, but for the histogram query, quantiles sketch was win.

Can you tell me more about why the the quantile query of approximate histogram was more accurate thanquantiles sketch?"

My results differ from the above chat. Is there something I am missing to get the correct answer?

Nehar

What is the size of the dataset? This sketch uses the kmv algorithm and the default value of k is 128. That may not result in good accuracy for all dataset sizes. You Could play around with the k value to get better accuracy

Vijay

This sketch uses the kmv algorithm
What are you talking about?

Regarding the original question. We need a better explanation: what is your use case? what is your input? what do you expect?

I tried changing the value of k. There was no change in my result values.

My use case is to query on data such that I have a histogram for a particular column.

quantilesDoublesSketchToHistogram post aggregator. The size of data i injected into my computer(locally from disk) is 3.98Mb. I did not create any injection spec. Druid has auto assigned it.

When I ran my histogram query, I expected an output in the form of an array.

To get a histogram I used the

What is the distribution of those values? Could you print the sketch summary in another post agg using quantilesDoublesSketchToString?

Nihar,

I played around with the two sketches on the wikipedia dataset and following are the results (I am doing the sketch on the added field with channel as the dimension

quantiles [0.5,0.75,0.9,0.95]

quantile sketch (k=1024) [10, 43, 287, 1055]

approx hist sketch (centroid=50) [0,0,25,805]

approx hist skecth (centroid=1000) [9.41,42.74,284.59,1010.89]

so the two values get close when the number of centroids is increased for the histogram sketch

histograms {100,200,300,400]

quantile sktech (k=1024) [8,11,6.999999]

approx hist sketch (centroid=50) [ 6.0, 10.67786693572998, 7.32395601272583 ]

approx hist sketch (centroid=1000) [ 8.0, 11.0, 7.0 ]

as you can see the number of centroids significantly affects the ersukts from the approx histogram sketch. However the two sketches return almost identical values with reasonably high centroids and k value.

Hope this helps.

vijay