Can someone explain to me in plain English, about this DataSketches |?
Which HLL is better, Apache data sketch or Druid HyperLogLogCollector?
It seems to me apache data sketch is better in terms accuracy and sizing difference is negligible. Am I wrong?
The druid version came first, then datasketches came out, and in general use less space and perform better.
I would say you’re right. Apache datasketches are the recommended version, the druid version is probably still around
for backwards compatibility (imo).
my understanding is that druid has two HLL implementations
- hyper unique aggregator - https://druid.apache.org/docs/latest/querying/hll-old.html. This is the older one and is used by default when you do a count distinct without setting the approximate count distinct to false or when you use the approx_count_distinct function
- The apache data sketches based HLL - https://druid.apache.org/docs/latest/development/extensions-core/datasketches-hll.html. This is used when you use approx_count_distinct_ds_hll.