Feeding Druid's HyperLogLog result into HyperLogLog libraries in python

Hi,

I am trying to get a serialized version of Druid’s HyperLogLog object through the query response by setting {“context”: {“finalize”: false}}. This gets me a base64 encoded HyperLogLog object. Now I want to take this serialized version then deserialize it into a java object. Once I have a java object, I want to copy the object into a python HyperLogLog object. There are several libraries available for HyperLogLog in python. I just have to pick one. Lets say I pick this one - https://github.com/ascv/HyperLogLog. I am guessing if I can get the internal details of a HyperLogLog object in java, I can copy that over to python. How can I achieve what I want to do?

Thanks,

Arjun

Hey Arjun,

Check out HyperLogLogCollector in Druid for the class that handles HLLs. If you deserialize into that, then you can poke at its internals to get the state of the HLL object. But also note that the Druid HLL implementation has been tweaked a bit from the original paper (see http://druid.io/blog/2014/02/18/hyperloglog-optimizations-for-real-world-systems.html) so take care when trying to translate it to the data structure used by an off the shelf HLL lib.

Thanks Gian. I was able to just translate Druid’s HLL code to python and got it working. Is there a way I can remove items from a Druid HLL object and get the count for the resultant object? So for example, consider a HLL object that contains the items “a”, “b” and “c” and gives the count of 3. Can I remove “c” from the same object and expect it to now give a count of 2?

Thanks,

Arjun

HLL doesn’t retain enough information to be able to handle deletes. If you need to be able to do this, take a look at datasketches: http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html. They take up somewhat more space but they can do set intersections and differences in addition to unions.

Hi Gian,

Since I have ported over Druid’s HLL implementation to python, I wanted to know what parameters I can tune to make it more accurate? Currently, I would like anything less than distinct count for 1000 objects to be accurate.

Thanks,

Arjun

Hey Arjun,

The main thing you can do is increase the number of registers, which will improve the accuracy. But HLL and its variants don’t guarantee exact counts for any amount of distinct values, even small ones.