hyperLogLog over "raw long text" vs hyperLogLog over "hashed text".

Hello,

I have a question about the best way to estimate the cardinality of a “long text” column.
We want to store “user queries” data (not for full-text search, don’t worry ^^U), and one question we’ll have to answer is how much different queries have been done in a certain period of time.

We normalize the text mapping variations of the same questions to one unique representation, but after this step we don’t know if it’s better to directly store this normalized text in Druid telling to the DB that the columns should be treated as a hyperLogLog metric, or if it’s better to hash the normalized text to avoid some potential problems.

Two related questions arises to me:

  • are the raw events stored without data loss in the deep storage? Or the rollup process destroys “low-level” information?
  • Is it possible to store event IDs in Druid without treating them as dimensions nor hyperLogLog metrics? In such case, Is it worth to assign IDs to low-level events in Druid (said in other words, is it possible to use access this non-metric & non-dimension data in any way through Druid)?
    Thank you in advance. Regards.

One “correction” to my question (* Is it possible to store event IDs in Druid without treating them
as dimensions nor hyperLogLog metrics?*):
I’m referring to the possibility of not treating the column as a dimension or any type of metric (not limited to hyperLogLog as I said in my previous question).

Again, thank you for your attention. Regards.

Anyone around here knows the responses to my doubts? ^_^U (the most important is the doubt about to hash or not to hash some “hyperUnique” metric columns).

Thank you!

Hey acorrea,

Currently every column in Druid has to be either a dimension or a metric. The rollup process is basically doing:

SELECT dim1, dim2, dim3, AGG1(met1), AGG2(met2), AGG3(met3) FROM your_data GROUP BY(dim1, dim2, dim3)

The raw data is not stored. This process retains all data from your dimensions but is generally lossy for your metrics (as we’re only storing the aggregate metrics, not the raw metric values). So usually in your cause people will choose to index the query field as a hyperUnique metric, assuming that approximate unique counts are OK.