wrong distinct value

I have a druid cluster with millions of records.

when given the following query, SELECT count(DISTINCT user_user_id) FROM details_orc

The results I am getting is lesser than the actual value.

I am giving the same data to presto. It gives the correct value for the same query.

What may be reason for this?

How to solve this?



Druid does not provide exact count distinct operations. It uses an approximate algorithm (I believe it is HyperLogLog, but could be mistaken). You may have noticed that it returns a decimal vs an integer number - and this is why.


Ben is right, Druid uses HyperLogLog for that query. If you need the exact distinct value, you could do a nested groupBy. I think you’ll need to do that as a direct Druid JSON query, though, since as far as I know the SQL interfaces don’t currently support that.