Druid HyperUnique Totals

I am currently using Druid 0.8.3.

Question 1: Should the Druid HyperUnique totals match the Cardinality totals? I have found that the values don’t match.

Druid Hadoop-based Batch Ingestion JSON:

{

“type”: “index_hadoop”,

“spec”: {

“dataSchema”: {

“dataSource”: “special_report-V1”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “csv”,

“columns” : [“eventDate”,“dim1”,“dim2”,“dim3”,“dim4”,“dim5”],

“timestampSpec”: {

“column”: “eventDate”,

“format”: “auto”

},

“dimensionsSpec”: {

“dimensions”: [“dim2”],

“dimensionExclusions”: ,

“spatialDimensions”:

}

}

},

“metricsSpec” : [{“type”: “count”, “name”: “count”}, { “type” : “hyperUnique”, “name” : “dim2_hyper_unique_count”, “fieldName” : “dim2” }],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “HOUR”,

“queryGranularity” : “HOUR”,

“intervals”: [“2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z”]

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “granularity”,

“dataGranularity”: “HOUR”,

“inputPath”: “/tmp/special-reports”,

“filePattern”: “.*.csv”

}

},

“tuningConfig”: {

“type”: “hadoop”,

“partitionsSpec”: {

“targetPartitionSize”: 0

}

}

}

}

``

Druid Query JSON:

{

“queryType”: “groupBy”,

“dataSource”: “special_report-V1”,

“granularity”: “day”,

“dimensions”: [“dim1”],

“aggregations”: [

{ “type”: “cardinality”, “name”: “dim2_CardinalityCount”, “fieldNames”: [“dim2”], “byRow”:true },

{ “type”: “hyperUnique”, “name”: “dim2_HyperUniqueCount”, “fieldName”: “dim2_hyper_unique_count” },

{“type”: “count”,“name”: “count”}

],

“postAggregations”: [

],

“intervals”: [ “2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z” ]

}

``

**Question 2: **Should the Druid HyperUnique totals differ if the related column is declared as a dimension? For the same input data for each batch ingestion, I have found that the values don’t match.

Druid Hadoop-based Batch Ingestion JSON - (With “dim2” dimension):

{

“type”: “index_hadoop”,

“spec”: {

“dataSchema”: {

“dataSource”: “special_report-V1”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “csv”,

“columns” : [“eventDate”,“dim1”,“dim2”,“dim3”,“dim4”,“dim5”],

“timestampSpec”: {

“column”: “eventDate”,

“format”: “auto”

},

“dimensionsSpec”: {

"dimensions": [“dim2”],

“dimensionExclusions”: ,

“spatialDimensions”:

}

}

},

“metricsSpec” : [{“type”: “count”, “name”: “count”}, { “type” : “hyperUnique”, “name” : “dim2_hyper_unique_count”, “fieldName” : “dim2” }],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “HOUR”,

“queryGranularity” : “HOUR”,

“intervals”: [“2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z”]

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “granularity”,

“dataGranularity”: “HOUR”,

“inputPath”: “/tmp/special-reports”,

“filePattern”: “.*.csv”

}

},

“tuningConfig”: {

“type”: “hadoop”,

“partitionsSpec”: {

“targetPartitionSize”: 0

}

}

}

}

``

Druid Hadoop-based Batch Ingestion JSON - (Without “dim2” dimension):

{

“type”: “index_hadoop”,

“spec”: {

“dataSchema”: {

“dataSource”: “special_report-V1”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “csv”,

“columns” : [“eventDate”,“dim1”,“dim2”,“dim3”,“dim4”,“dim5”],

“timestampSpec”: {

“column”: “eventDate”,

“format”: “auto”

},

“dimensionsSpec”: {

"dimensions": [],

“dimensionExclusions”: ,

“spatialDimensions”:

}

}

},

“metricsSpec” : [{“type”: “count”, “name”: “count”}, { “type” : “hyperUnique”, “name” : “dim2_hyper_unique_count”, “fieldName” : “dim2” }],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “HOUR”,

“queryGranularity” : “HOUR”,

“intervals”: [“2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z”]

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “granularity”,

“dataGranularity”: “HOUR”,

“inputPath”: “/tmp/special-reports”,

“filePattern”: “.*.csv”

}

},

“tuningConfig”: {

“type”: “hadoop”,

“partitionsSpec”: {

“targetPartitionSize”: 0

}

}

}

}

``

Druid Query JSON:

{

“queryType”: “groupBy”,

“dataSource”: “special_report-V1”,

“granularity”: “day”,

“dimensions”: [“dim1”],

“aggregations”: [

{ “type”: “hyperUnique”, “name”: “dim2_HyperUniqueCount”, “fieldName”: “dim2_hyper_unique_count” },

{“type”: “count”,“name”: “count”}

],

“postAggregations”: [

],

“intervals”: [ “2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z” ]

}

``

No, the values do not necessarily match. HyperUnique ingests the dimension as a metric. Cardinality runs uses the dimension value and HLL to calculate uniques.