hyperUnique and cardinality scientific notation

We are trying to get a distinct count of users using hyperUnique and cardinality and anytime the results exceed 10M, a query will return in scientific notation (i.e.

"distinct_user_id" : 1.0584265650404368E7,) .

Is there a setting that will show the full numeric value?



If you need that many significant figures in a hyperUnique result I can assure you that your application is wrong. Please read up on HyperLogLog and its limitations.

http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html is a good start.

Thanks for the reply Charles.

We are trying to get a count of distinct users that fall within certain criteria (i.e. male/age 25-30/Does something ABC/Lives in XYZ etc…). What would be the best approach to get large distinct counts (10m+ users)?

Is this event data or audience data? if it is event data with the audience attributes tacked onto the event, then a hyperunique estimation will probably be your best bet for the time being. If your application requires 100% accuracy then there are other considerations to take into account (like making certain you are only querying data from batch processing). If it is Audience data, assuming you have a unique ID per audience member, then you can simply issue a count aggregate in a query with the filter as you described.

Hey Will,

If you are OK with a modest error rate (a few %) then hyperUnique is the way to go. You can convert the number you get to a long after reading it from Druid, just be aware that it is not an exact calculation.

If you absolutely need an exact calculation it will be quite a bit slower. You could do a nested groupBy query, but at 10M you will run into groupBy resultset limits. You could bump those up a lot, and bump up the JVM heap accordingly, and that should work. Another possibly less precarious approach is to pull the list of users using a series of lexicographic topNs (http://druid.io/docs/latest/querying/topnmetricspec.html) and count them on your end.

Thanks for the replies.

We are only looking for an approximation at this point, so hyperUnique works for us. We will try the approach of converting it to a long after getting the result from Druid.


I will also add that datasketches can provide exact results below a configuration threshold, and approximate results beyond it. You can also do set intersections with theta sketches, making them more powerful than hyperloglog. Unfortunately, theta sketches also require much more storage than hyperUnique.