We have a batch ingestion pipeline in place that consists of an ETL app that pre-processed log files and a Druid Hadoop indexer that ingests the output from the ETL app.
I wonder whether it is necessary to pre-process fields within the ETL app that will become hyperloglog/thetasketches in Druid.
Our ETL app currently puts fields slated to be unique counts through a maximum-entropy hash. We saw an example that was using md5 and now switched to murmur for performance reasons but I wonder whether this hashing can be skipped altogether.
Perhaps it would be best to illustrate myquestions by means of a sample recordset with ten records that have a field named “uniqueField”
uniqueField: xxxxxxxa, xxxxxxxb, xxxxxxxc, xxxxxxxd, xxxxxxxd, null, null, null, null, null
To my understanding, this should yield a unique count of 9, given that there are two records containing the same field value: xxxxxxxd .
question 1) would a hyperloglog/thetasketch treat the null values as distinct entries? In my opinion it would make a lot of sense if it did because it would forego the need of generating random ids in an ETL preprocessing step for the null values to have Druid interpret them as uniques.
question 2) is it necessary or recommended to put the field through a maximum-entropy hash and would such a thing make the estimates more accurate? Without the hashing the above fields only differ on a single digit. In my opinion it would be best if this sort of stuff wasn’t necessary or would be taken care of by Druid itself.
question 3) IF a hashing of uniques is recommended which hash algorithm would suffice? We tried out md5 and murmur but they are costly operations.