I am having Cardinality Aggregator in my aggregation field for my druid query as follows:
As per detail mentioned in Druid Doc :
The Cardinality on Single DImension is something like :
SELECT COUNT(DISTINCT(dimension)) FROM <datasource>
Based on above understanding I am expecting to get the number of aggregated rows (in integer value ofcourse),
But I am getting decimal value
How can I make sure I get the exact result which is 60 in my case?
Is it because of some approximation druid does?
Your guess is right, the “cardinality” aggregator is approximate and the decimal value is an effect of that. If you need an exact count, you could use a nested groupBy (outer groupBy does a “count”, inner groupBy groups by the thing you want to count distinct of). This will be slower and more resource intensive though.
Thanks for the confirmation and suggestion Gian.
Is there any place where I can learn how the cardinality stuff works in druid ?
However, using groupBy is not possible in my case because of its resource intensive nature.So thats out of the picture.
Is there any way I can handle it on client side, For example If the value coming from druid is between 63.00 and 63.99 then I can safely say the exact count would have been 64 , or any similar kind of if and buts based on the decimal value which druid returns.
You can read about the algorithm here:
In general you can expect the error to be on average 3% or so. So, rounding it to the nearest integer will not give exact results.
Gian, is the error lower for smaller sets of data? We’d like to display values accurate to the whole number for ‘small’ values (say < 1000) and willing to live with that error for larger values. Is there some way to achieve this? Is it possible to tune the hyperlog algorithm (say number of bits or something) to have higher accuracy?
For smaller sets, the relative error should be about the same, but the absolute error is lower. You’ll still get a decimal but you can round it off if you want (although the rounded number isn’t guaranteed to be accurate). The HLL algorithm in general is tunable but Druid’s implementation is not.
If you know the distinct set will be small, you could ask Druid for an exact count, and it usually doesn’t take that long. Exact distinct counts are only really resource intensive if the set is large.
You could also explore datasketches (http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html) which have somewhat different behavior than HLL especially at low set sizes. I believe they can guarantee exactness below certain thresholds, but I’m not sure of the details. Maybe an expert in datasketches could chime in.
Thanks Gian. When you say - ask Druid for an exact count, are you referring to a groupBy query on the dimension we want to count? Or are you referring to http://druid.io/docs/latest/development/extensions-contrib/distinctcount.html - does this give me the exact count of distinct values?
I was referring to a groupBy on the dimension you want to count. It would be an outer query with a “count” aggregator and an inner query that groups on the column you’re counting.