We have a use case where we have to report exact distinct counts.
patient_id, diagnosis_id, prescription_id, hospital_id…
We have to compute exact distinct counts across (groupby) dimensions, ex: geography, hospital type, department…
Assuming that resources is not a constraint (CPU, RAM). We are using latest version of druid, 0.15.0. And the following exercises were tried using SQL and not native json request.
We tried the following:
Disabled approximate count (useApproximateCountDistinct=false). I am getting exact count, but as documented we can use only one count(distinct col1) in a query. But we need to do counts on a number of columns/dimensions. So this is not a viable option.
we used APPROX_COUNT_DISTINCT_DS_THETA() on each of the columns (patient_id, diagnosis_id, prescription_id, hospital_id) and compared with count(distinct) (multiple runs, as only one count distinct allowed). The counts are exactly matching. We tried various combinations to validate the data, and it is exactly matching.
So the question is, is the DS thetha algorithm so accurate, it is giving exact precise value.
Is there a possibility, for some other combination (which we havent tried), we may not get exact count?
Are there other options to get exact distinct counts in a single query on multiple columns/dimensions?