I have a question about the best way to estimate the cardinality of a “long text” column.
We want to store “user queries” data (not for full-text search, don’t worry ^^U), and one question we’ll have to answer is how much different queries have been done in a certain period of time.
We normalize the text mapping variations of the same questions to one unique representation, but after this step we don’t know if it’s better to directly store this normalized text in Druid telling to the DB that the columns should be treated as a hyperLogLog metric, or if it’s better to hash the normalized text to avoid some potential problems.
Two related questions arises to me:
- are the raw events stored without data loss in the deep storage? Or the rollup process destroys “low-level” information?
- Is it possible to store event IDs in Druid without treating them as dimensions nor hyperLogLog metrics? In such case, Is it worth to assign IDs to low-level events in Druid (said in other words, is it possible to use access this non-metric & non-dimension data in any way through Druid)?
Thank you in advance. Regards.