Non-deterministic query results with approximate histogram aggregator

Hi all,

I’d really appreciate any insight you could give on this. I’m seeing very strange behavior where I get vastly different results for approximate histogram queries when I run them back to back on historical (immutable) data. What’s worse is that the results differ by 4 orders of magnitude.

Does the algorithm depend on the order in which results are combined?

What could account for the wild variation? Some kind of integer overflow perhaps?

Which of the two results is likely the “correct” approximation?

#This is the query that I run twice in a row

{

“queryType”: “timeseries”,

“dataSource”: “svc_perf”,

“granularity”: “all”,

“aggregations”: [

{

“type”: “approxHistogramFold”,

“name”: “hist_fold”,

“fieldName”: “perf_hist”

},

{

“type”: “longSum”,

“name”: “volume”,

“fieldName”: “calls”

}

],

“postAggregations”: [

{

“type”: “customBuckets”,

“name”: “histogram”,

“fieldName”: “hist_fold”,

“breaks”: [

0,

10000,

999999999999999

]

}

],

“intervals”: “2015-06-26T04:00:00.000Z/2015-07-03T04:00:00.000Z”

}

#Sometimes I get this result where the counts add up to the total volume and the number of calls with latency greater then 10000ms is about 23k.

[

{

“timestamp”: “2015-06-26T04:00:00.000Z”,

“result”: {

“volume”: 8019055848,

“histogram”: {

“breaks”: [

0,

10000,

999999986991104

],

“counts”: [

8017324544,

23631

]

},

“hist_fold”: {

“breaks”: [

-162086.015625,

-999.4375,

160087.140625,

321173.71875,

482260.3125,

643346.875,

804433.4375,

965520

],

“counts”: [

7.156820356613025e-05,

8019054592,

1383.549072265625,

0.2833900451660156,

0.8648185729980469,

34.847755432128906,

6

]

}

}

}

]

Sometimes I get this answer where the sum of the counts is low by ~2B calls and the bucket for calls greater than 10000ms is ~1B which is larger by 4 orders of magnitude than the previous query result.

[

{

“timestamp”: “2015-06-26T04:00:00.000Z”,

“result”: {

“volume”: 8019055848,

“histogram”: {

“breaks”: [

0,

10000,

999999986991104

],

“counts”: [

5494645248,

1067470144

]

},

“hist_fold”: {

“breaks”: [

-162086.015625,

-999.4375,

160087.140625,

321173.71875,

482260.3125,

643346.875,

804433.4375,

965520

],

“counts”: [

0.061051566153764725,

8019054592,

1383.549072265625,

0.2833900451660156,

0.8648185729980469,

34.847755432128906,

6

]

}

}

}

]

Thanks,

Roger

I forgot to mention that I’m using 0.7.1.1

Hi Roger, can you try issuing the same query with:
“context”: {“useCache”:false,“populateCache”:false}

I wonder if there’s some caching problem that you are hitting.

I’m asking some folks more familiar with the approximate histogram code to check out this thread.

Thanks, Fangjin.

Adding those context settings did not help. Even with those set, I get different (and wacky) results on repeat queries.

These are the results I got for 4 queries in a row, all with breaks set close to 1000ms. The numbers are wildly different.

{

“timestamp”: “2015-06-26T04:00:00.000Z”,

“result”: {

“volume”: 8019055848,

“histogram”: {

“breaks”: [

0,

1000,

999999986991104

],

“counts”: [

4208447232,

2581501184

]

},

“hist_fold”: {

“breaks”: [

-162086.015625,

-999.4375,

160087.140625,

321173.71875,

482260.3125,

643346.875,

804433.4375,

965520

],

“counts”: [

0.05150444805622101,

8019054592,

1117.7796630859375,

210.0430450439453,

65.29234313964844,

35,

6

]

}

}

}

]

[

{

“timestamp”: “2015-06-26T04:00:00.000Z”,

“result”: {

“volume”: 8019055848,

“histogram”: {

“breaks”: [

0,

1000,

999999986991104

],

“counts”: [

5506124800,

1283823616

]

},

“hist_fold”: {

“breaks”: [

-162086.015625,

-999.4375,

160087.140625,

321173.71875,

482260.3125,

643346.875,

804433.4375,

965520

],

“counts”: [

0.05150444805622101,

8019054592,

1389,

0,

1,

35,

6

]

}

}

}

]

[

{

“timestamp”: “2015-06-26T04:00:00.000Z”,

“result”: {

“volume”: 8019055848,

“histogram”: {

“breaks”: [

0,

1000,

999999986991104

],

“counts”: [

5506124800,

1283823616

]

},

“hist_fold”: {

“breaks”: [

-162086.015625,

-999.4375,

160087.140625,

321173.71875,

482260.3125,

643346.875,

804433.4375,

965520

],

“counts”: [

0.05150444805622101,

8019054592,

1389,

0,

1,

35,

6

]

}

}

}

]

[

{

“timestamp”: “2015-06-26T04:00:00.000Z”,

“result”: {

“volume”: 8019055848,

“histogram”: {

“breaks”: [

0,

1001,

999999986991104

],

“counts”: [

314776704,

3745873152

]

},

“hist_fold”: {

“breaks”: [

-162086.015625,

-999.4375,

160087.140625,

321173.71875,

482260.3125,

643346.875,

804433.4375,

965520

],

“counts”: [

0.16587284207344055,

8019054592,

1026.8448486328125,

234.38809204101562,

73.20160675048828,

34.478416442871094,

6.064126968383789

]

}

}

}

]

Roger, the order of merging results does matter for approximate histograms, which is why things can move around a bit.

In your case I think what’s happening is that the buckets you are looking at are only covering very few centroids, which leads to wildly varying results if those shift just a bit, since it’s trying to interpolate between centroids.

I would try increasing the resolution of the aggregator to a few hundred centroids and see how that improves your results.

Thank you, Xavier. I’ll give it a try.

the highly non deterministic behavior of histrogram aggregator should be documented IMHO…

Agree. We’re working on it.

Hi Xavier,

Should I also increase the number of buckets? How does that setting relate to accuracy?

Thanks,

Roger

Hi Roger, the number of buckets only determines the default output when you are not using a post-aggregator.
In your case, you are already using a customBuckets post-aggregators, so the only thing that will affect accuracy is resolution

(as described in the docs http://druid.io/docs/latest/development/approximate-histograms.html)

If you don’t care about accuracy outside a certain range of data, you might want to set a lower and upperLimit.

This particular algorithm (as indicated in the paper) does not work well with highly skewed data, so depending on the distribution and what portion you are interested in you may have to

use buckets in the thousands to get decent estimates in the tails.

The other thing that can affect your histogram is the ordering of the data. If the data in your column is not randomly distributed it could cause centroids to clump together instead of doing a good approximation. We also took some shortcuts when merging in the interest of speed, which will probably cause errors to be larger than expected in those cases.

Bottom line, this feature is still labeled experimental for a reason. It’s great to get a first order approximation for quantiles and and idea of data distributions, but still requires a fair amount of tweaking depending on the underlying data and there is no one-size fits all setting that will work out of the box.

It could also be that for your data it doesn’t work at all for some reason. Maybe you can share the underlying data and/or the segment in question? I can have a look and see if there’s anything odd going on.

Thanks!

Xavier