The results of Druid native query and Druid sql query don't match

I am using Druid 0.13.0-incubating. I used kafka indexing ingestion created a datasource. The indexing file defined “ts_card_hashed_client_id” as dimension, timestampSpe=ts_transaction_timestamp; metrics has count, doubleSum of “ts_amount” field, segmentGranularity using DAY:
{

“type”: “kafka”,

“dataSchema”: {

“dataSource”: “test”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “ts_transaction_timestamp”,

“format”: “auto”

},

“dimensionsSpec”: { “dimensions”: [ “ts_card_hashed_client_id” ]}}},

“metricsSpec” : [

{

“name”: “count”,

“type”: “count”

},

{

“type” : “doubleSum”,

{

“type” : “doubleSum”,

“name” : “Total_Sales”,

“fieldName” : “ts_amount”

}

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: “NONE”,

“intervals” : [“2017-02-01/2018-12-31”],

“rollup”: false

It successfully ingested 300K messages from kafka. Coordinator shows the datasource and segments.

I wrote two query files in druid native query and sql:

  1. dsql.json

{

“queryType”: “timeseries”,

“dataSource”: “test”,

“granularity”: “month”,

“descending”: “true”,

“aggregations”: [

{“type”: “count”, “name”: “num_transactions”},

{“type”: “doubleSum”, “name”: “sum_transaction_amount”, “fieldName”: “Total_Sales”},

{“type”: “count”, “name”: “num_clients”, “fieldName”: “ts_card_hashed_client_id”},

{“type”: “distinctCount”, “name”: “num_distinct_clients”, “fieldName”: “ts_card_hashed_client_id”}

],

“intervals”: [“2017-02-01T00:00:00.000/2018-12-31T00:00:00.000”]

}

  1. sql.json

{

“query”:“SELECT FLOOR(__time to MONTH), COUNT(*) AS num_transactions , SUM(Total_Sales) AS sum_transaction_amount, COUNT(ts_card_hashed_client_id) AS num_clients, COUNT(DISTINCT ts_card_hashed_client_id) AS num_distinct_clients FROM test GROUP BY FLOOR(__time to MONTH) ORDER BY FLOOR(__time to MONTH) DESC”

}

I’ve noticed the response of num_clients and num_distinct_clients from both queries aren’t the same:

output from dsql.json

{

“timestamp” : “2018-12-01T00:00:00.000Z”,

“result” : {

“num_distinct_clients” : 5306,

“num_transactions” : 5387,

“num_clients” : 5387,

“sum_transaction_amount” : 42223.1

}

output from sql.json

{

“EXPR$0”:“2018-12-01T00:00:00.000Z”,

“num_transactions”:5387,

“sum_transaction_amount”:42223.1,

“num_clients”:5334,

“num_distinct_clients”:4806

}

Can someone help to explain?

Many thanks,

Christine

Hey Christine,

I haven’t looked in detail at your queries but the quickest way to figure out what’s causing the discrepancy might be to run your sql query with “EXPLAIN PLAN FOR” at the front, Druid will return the native JSON query that it executes - you can then compare this with your own json query to check for differences or share the results if you’d like some help.

Best regards,

Dylan

Answer my own question. sql query by default has druid.sql.planner.useApproximateCountDistinct = true
still need to understand a bit more about distinctCount via native query though.