[druid-user] Druid HLL vs Datasketch HLL


Can someone explain to me in plain English, about this DataSketches |?

Which HLL is better, Apache data sketch or Druid HyperLogLogCollector?
It seems to me apache data sketch is better in terms accuracy and sizing difference is negligible. Am I wrong?



The druid version came first, then datasketches came out, and in general use less space and perform better.
I would say you’re right. Apache datasketches are the recommended version, the druid version is probably still around
for backwards compatibility (imo).

my understanding is that druid has two HLL implementations

  1. hyper unique aggregator - https://druid.apache.org/docs/latest/querying/hll-old.html. This is the older one and is used by default when you do a count distinct without setting the approximate count distinct to false or when you use the approx_count_distinct function
  2. The apache data sketches based HLL - https://druid.apache.org/docs/latest/development/extensions-core/datasketches-hll.html. This is used when you use approx_count_distinct_ds_hll.