Count distinct question

Hi guys ,

I have two kafka stream datasource with same kafka topic , and I wanna to count distinct “VID” ,so I use cardinality and hyperunique, but it differs a lot.

this is my config

a:

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

},

{

“type”: “hyperUnique”,

“name”: “unique_uid”,

“fieldName”: “UID”,

“isInputHyperUnique”: false,

“round”: false

},

{

“type”: “hyperUnique”,

“name”: “unique_vid”,

“fieldName”: “VID”,

“isInputHyperUnique”: false,

“round”: false

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: “MINUTE”,

“rollup”: true,

“intervals”: null

},

b:

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: “MINUTE”,

“rollup”: true,

“intervals”: null

},

a query

SELECT

paramId,

APPROX_COUNT_DISTINCT(unique_vid)

from “a” where type in (‘appDownloadNew’) and paramId in (‘2301’) and paramType=‘appDetail’

and __time>=‘2020-05-17T09:10:02.378+08:00’-- and __time<=‘2020-05-18T09:10:02.378+08:00’

GROUP by para

count=327

b query

SELECT

paramId,

APPROX_COUNT_DISTINCT(VID)

from “olap-client-logs-minutely” where type in (‘appDownloadNew’) and paramId in (‘2301’) and paramType=‘appDetail’

and __time>=‘2020-05-17T09:10:02.378+08:00’-- and __time<=‘2020-05-18T09:10:02.378+08:00’

GROUP by paramId

count= 11328

327

Hey,
Your metricsSpec seems to be missing (or off), can you please send the correct metricSpec?

As a side-note, perhaps you’re already aware of it, but for approximation purposes of count distinct, it’s advised to use the DataSketches aggregators (see https://druid.apache.org/docs/latest/querying/aggregations.html#cardinality-hyperunique).

Thanks,

Itai

Thanks , I’ll check the DataSketches api

在 2020年5月18日星期一 UTC+8上午11:47:35,俞小勇写道: