Broker not giving Approximate/Accurate Count for unique value count

Hi Experts,

I am using druid-0.10.0 version.

I have implemented data-sketch to get unique count or union and so on…

But when I am doing a query to broker its giving me 50% count of what it is. If we say Approximate it should return at least 95%.

But when I am querying to historical directly then its giving Approximate count (95%-100%)

I am creating more than one segment for one Hour and using 10 historical nodes to store it.

My data 1.3B rows per day and 200GB size per day.

My implementation goes wrong because of broker node. How broker node merges all data to return query Data.

Please help me to get this. My whole query structure is based on this.

Thanks,

Jitesh

Here is my Broker runtime.properties

druid.host=xxx.xx.xx.xxx

druid.service=druid/broker

druid.port=8082

Jetty Server

druid.server.http.numThreads=20

druid.server.http.maxIdleTime=PT60M

#retry

druid.broker.retryPolicy.numTries=2

#prosessor

druid.processing.buffer.sizeBytes=2140000000

druid.processing.formatString=processing_%s

druid.processing.numThreads=7

druid.processing.numMergeBuffers=3

#Query Configuration

druid.query.groupBy.maxIntermediateRows=1073741824

druid.query.groupBy.maxResults=1073741824

druid.broker.http.readTimeout=PT60M

druid.query.groupBy.maxOnDiskStorage=1073741824

druid.query.search.maxSearchLimit=1000000

#caching

#druid.broker.cache.useCache=true

#druid.broker.cache.populateCache=true

druid.zk.service.host=xxx.xx.xx.xxx:2181

druid.discovery.curator.path=/druid/discNew

druid.zk.paths.base=/druid

druid.extensions.loadList=[“druid-kafka-eight”, “druid-s3-extensions”, “sqlserver-metadata-storage”, “druid-distinctcount”, “druid-datasketches”, “scan-query”]

druid.extensions.directory=/opt/druid-0.10.0/extensions/

druid.extensions.hadoopDependenciesDir=/opt/druid-0.10.0/hadoop-dependencies/

druid.startup.logging.logProperties=true

druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]

druid.emitter=LoggingEmitter

druid.emitter.logging.logLevel=info

druid.sql.enable=true

druid.sql.avatica.maxConnections=500

druid.sql.avatica.maxStatementsPerConnection=50

druid.sql.planner.maxQueryCount=0

druid.sql.planner.maxSemiJoinRowsInMemory=1000000

druid.sql.planner.maxTopNLimit=1000000

druid.processing.columnCache.sizeBytes=1300000000

Hey Jitesh,

Could you share some more details about the tests you’ve done (queries, results, etc)?

Hi Gian

Thank you for reviewing this.

Query:

{

“queryType”: “timeseries”,

“dataSource”: “user_event_analytics”,

“granularity”: “all”,

“filter”: {

“type”: “and”,

“fields”: [

{

“type”: “selector”,

“dimension”: “location”,

“value”: “l1”

}

]

},

“aggregations”: [

{

“type”: “longSum”,

“name”: “numz_count”,

“fieldName”: “numz”

},

{

“type”: “thetaSketch”,

“name”: “unique_user_sketch”,

“fieldName”: “unique_user_sketch”

}

],

“intervals”: [

“2017-07-04T00:00:00.000Z/2017-07-04T10:00:00.000Z”

]

}

Result:

{

“version”: “v1”,

“timestamp”: “2017-07-05T04:00:00.000Z”,

“event”: {

“numz_count”: 1200,

“unique_user_sketch”: 124

}

}

we could query with distinctCount aggregator but we can’t use it for UNION, INTERSECT OR NOT ETC…

Unique Numbers are very less 50-55% accuracy if I query to the broker. But, If I query to the historical node directly I am seeing 99% accuracy unique count.

That’s why it’s failing in UNION, INTERSECT OR NOT Queries with retention analysis query also.

I am ingesting thetaSketch dimension LIKE: {“type”: “thetaSketch”, “name”: “unique_user_sketch”, “fieldName”: “uid”}

I am ingesting uid with default **theta-size. **

While doing a query to broker it’s not returning approximate count or merging what historical node are doing.

How broker handle unique numbers or theta operations by multiple historical nodes and for multiple segments?

Thanks,

Jitesh

Hi Gian,

Here are my historical runtime.properties

Historical Config:

druid.host=xxx.xx.xx.xxx

druid.service=druid/historical

druid.port=80

Storage (Historical)

druid.server.maxSize=500000000000

druid.server.tier=tier_0_user

Segment Cache (Historical)

druid.segmentCache.locations=[{“path”: “/analytics_data/segCache/”, “maxSize”: 500000000000}]

#druid.segmentCache.deleteOnRemove

druid.segmentCache.infoDir=/analytics_data/segInfo/

Jetty Server

druid.server.http.numThreads=30

druid.server.http.maxIdleTime=PT10m

Processor

druid.processing.buffer.sizeBytes=2147483647

druid.processing.formatString=processing_%s

druid.processing.numThreads=15

#Query Configuration

druid.query.groupBy.maxIntermediateRows=1073741824

druid.query.groupBy.maxResults=1073741824

druid.query.search.maxSearchLimit=1000000

#caching

druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

druid.zk.service.host=xxx.xx.xx.xxx:2181

druid.discovery.curator.path=/druid/discNew

druid.zk.paths.base=/druid

druid.extensions.loadList=[“druid-kafka-eight”, “druid-s3-extensions”, “sqlserver-metadata-storage”, “druid-distinctcount”, “druid-datasketches”, “scan-query”]

druid.extensions.directory=/opt/druid-0.10.0/extensions/

druid.extensions.hadoopDependenciesDir=/opt/druid-0.10.0/hadoop-dependencies/

druid.startup.logging.logProperties=true

druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”]

druid.emitter=LoggingEmitter

druid.emitter.logging.logLevel=info

druid.query.groupBy.maxOnDiskStorage=100000000

The historical node also giving less count for thetaSketch unique.

Can you please help to get this problem?

Query Direct to historical node:

{

“queryType”: “timeseries”,

“dataSource”: “user_event_analytics”,

“granularity”: “all”,

“filter”: {

“type”: “and”,

“fields”: [

{

“type”: “selector”,

“dimension”: “location”,

“value”: “l1”

}

]

},

“aggregations”: [

{

“type”: “longSum”,

“name”: “numz_count”,

“fieldName”: “numz”

},

{

“type”: “thetaSketch”,

“name”: “unique_user_sketch”,

“fieldName”: “unique_user_sketch”

},

{

“type”: “distinctCount”,

“name”: “distinctCount”,

“fieldName”: “uid”

}

],

“intervals”: [

“2017-07-04T01:00:00.000Z/2017-07-05T02:00:00.000Z”

]

}

Result:

{

"timestamp": “2017-07-04T01:00:00.000Z”,

"result": {

"distinctCount": 6270,

"numz_count": 52404,

"unique_user_sketch": 5445

}

}

Thanks,

Jitesh