Exact distinct count

Hi,

We have a use case where we have to report exact distinct counts.

ex data:

patient_id, diagnosis_id, prescription_id, hospital_id…

We have to compute exact distinct counts across (groupby) dimensions, ex: geography, hospital type, department…

Assuming that resources is not a constraint (CPU, RAM). We are using latest version of druid, 0.15.0. And the following exercises were tried using SQL and not native json request.

We tried the following:

  1. Disabled approximate count (useApproximateCountDistinct=false). I am getting exact count, but as documented we can use only one count(distinct col1) in a query. But we need to do counts on a number of columns/dimensions. So this is not a viable option.

  2. we used APPROX_COUNT_DISTINCT_DS_THETA() on each of the columns (patient_id, diagnosis_id, prescription_id, hospital_id) and compared with count(distinct) (multiple runs, as only one count distinct allowed). The counts are exactly matching. We tried various combinations to validate the data, and it is exactly matching.

So the question is, is the DS thetha algorithm so accurate, it is giving exact precise value.

Is there a possibility, for some other combination (which we havent tried), we may not get exact count?

Are there other options to get exact distinct counts in a single query on multiple columns/dimensions?

Regards, Chari.

I believe that when the dimension’s cardinality is less than the size of the theta sketch it return exact counts. After that it returns approximations.

https://datasketches.github.io/docs/Theta/KMVupdateVkth.html

Hi Steve,

Thank you.

I read the documentation again, now it is clear for me.

To increase the accuracy of approximation, we can use parameters (Theta: Size, HLL: lgK, tgtHllType).

Iam using SQL.

with parameter: druid.sql.planer.useApproximateCountDistinct=true; the following query executes fine.

select ReportingCategory, Store_CountryName,

count(distinct DW_EK_Product), APPROX_COUNT_DISTINCT_DS_THETADW_EK_Product),

APPROX_COUNT_DISTINCT_DS_HLL(DW_EK_Product)

from retail_data_rollup_index_hyper group by ReportingCategory, Store_CountryName

But with paramter: druid.sql.planer.useApproximateCountDistinct=false; the same query throws exception stating can’t build plan.

I know that only one count(distinct) can be used in an SQL, with above parameter to false. I was of the opinion that i can use any many number of DS function calls on other columns.

I did not find any documentation on this restriction. Is there a limit/restriction?

Regards, Chari.