Question: recommended pre-processing for hyperloglog/thetasketch measures

We have a batch ingestion pipeline in place that consists of an ETL app that pre-processed log files and a Druid Hadoop indexer that ingests the output from the ETL app.

I wonder whether it is necessary to pre-process fields within the ETL app that will become hyperloglog/thetasketches in Druid.
Our ETL app currently puts fields slated to be unique counts through a maximum-entropy hash. We saw an example that was using md5 and now switched to murmur for performance reasons but I wonder whether this hashing can be skipped altogether.

Perhaps it would be best to illustrate myquestions by means of a sample recordset with ten records that have a field named “uniqueField”

uniqueField: xxxxxxxa, xxxxxxxb, xxxxxxxc, xxxxxxxd, xxxxxxxd, null, null, null, null, null

To my understanding, this should yield a unique count of 9, given that there are two records containing the same field value: xxxxxxxd .

question 1) would a hyperloglog/thetasketch treat the null values as distinct entries? In my opinion it would make a lot of sense if it did because it would forego the need of generating random ids in an ETL preprocessing step for the null values to have Druid interpret them as uniques.

question 2) is it necessary or recommended to put the field through a maximum-entropy hash and would such a thing make the estimates more accurate? Without the hashing the above fields only differ on a single digit. In my opinion it would be best if this sort of stuff wasn’t necessary or would be taken care of by Druid itself.

question 3) IF a hashing of uniques is recommended which hash algorithm would suffice? We tried out md5 and murmur but they are costly operations.


Both hyperUniques and thetaSketch aggregator in druid can ingest raw [string] input (no hashing etc required). However, both of them ignore null values, so if you want to count each null as a separate unique value then you would have to do that in ETL side.

– Himanshu

I mean for nulls, you would have to encode them into unique-ids in the ETL side so that druid counts them as uniques.

Cool, thanks a lot for these insights.

We generate random ids in case of null values and thanks to you I now know that we can remove the md5 hashing which will speed up our ingestion process.

thanks a lot Himanshu