Expected variation in hyperunique counts

Hi,

What is the expected variation in hyperunique counts in druid against an actually computed unique count .

Is there a way to change this limit at the cost of performance?

My daily events would be around the range of 15-20 million , and the uniques would around the range of < 2million

hourly events would be around 1.5-2 million unique would be around 40-50k (uniques across day will be less and summing up uniques across hours)

Thanks

Manohar

Hi Manohar, have you read this blog post? http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html

Thanks Charles,

Strangely I seem to be getting a 2.9% variation on a relatively small data set. Not sure if I am doing something wrong.

"approximation:

  • Increasing the number of buckets (the k) increases the accuracy of the approximation
  • Increasing the number of bits of your hash increases the highest possible number you can accurately approximate""
    Is there a way to change these values as an end user of druid to get a better accuracy? My data set is relatively small under normal circumstances
    Tests

if you’re normally using small datasets and need more accurate results, you may want to look at theta sketches: https://datasketches.github.io/ which is included in https://github.com/druid-io/druid/releases/tag/druid-0.8.3

Hi,

You can find the docs for druid datasketch module at http://druid.io/docs/0.8.3/development/datasketches-aggregators.html . It allows you to make the trade-off between accuracy and sketch size, you can see the details on datasketches.github.io as Charles pointed.

– Himanshu

Thanks ,

I am not able to view this page, which I assume has the right configurations to use(aggregatorName etc)

http://druid.io/docs/latest/development/datasketches-aggregators.html

Thanks and Regards

Manohar

Manohar, try again. I fixed the problem with that bad link.