How do I utilize the HyperUnique aggregator?

I am not quite sure on how to link the “hyperUnique” metric to the computation of cardinality. Has anyone had any experience with this and able to provide an example of this? Any help appreciated.

From the documentation here: http://druid.io/docs/latest/querying/aggregations.html#cardinality-aggregator

Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality. Please note that this aggregator will be much slower than indexing a column with the hyperUnique aggregator.

From the documentation here: http://druid.io/docs/0.8.3/querying/aggregations.html#hyperunique-aggregator

Uses HyperLogLog to compute the estimated cardinality of a dimension that has been aggregated as a “hyperUnique” metric at indexing time.

{ "type" : "hyperUnique", "name" : <output_name>, "fieldName" : <metric_name> }

``

Below is a quick outline of my current thinking. Thoughts?

Druid Hadoop-based Batch Ingestion JSON:

{
  "type": "index_hadoop",
  "spec": {
    "dataSchema": {
      "dataSource": "special_report-V1",
      "parser": {
        "type": "string",
        "parseSpec": {
          "format": "csv",
          "columns" : ["dim1","dim2","dim3","dim4","dim5"],
          "timestampSpec": {
            "column": "msgDate",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": ["dim1","dim2","dim3","dim4","dim5"],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec" : [{"type": "count", "name": "count"}, { "type" : "hyperUnique", "name" : "dim2_count", "fieldName" : "dim2" }],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "HOUR",
        "queryGranularity" : "HOUR",
        "intervals": ["2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z"]
      }
    },
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "granularity",
        "dataGranularity": "HOUR",
        "inputPath": "/tmp/special-reports",
        "filePattern": ".*.csv"
      }
    },
    "tuningConfig": {
        "type": "hadoop",
        "partitionsSpec": {
          "targetPartitionSize": 0
        }
    }
  }
}

**Druid Query JSON:**

{

“queryType”: “groupBy”,

“dataSource”: “special_report-V1”,

“granularity”: “day”,

“dimensions”: [“dim1”],

“aggregations”: [

{ “type”: “cardinality”, “name”: “dim2Count1”, “fieldNames”: [“dim2”], “byRow”:false },

{ “type”: “cardinality”, “name”: “dim2Count2”, “fieldNames”: [“dim2_count”], “byRow”:false },

{“type”: “count”,“name”: “count”}

],

“postAggregations”: [

],

“intervals”: [ “2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z” ]

}


Hey Mark,

If you use a hyperUnique at ingestion time, you should use a hyperUnique at query time too. At query time, “hyperUnique” works on columns created with “hyperUnique” and “cardinality” works on regular string columns.

So if I wish to utilize the “hyperUnique” aggregator, does the following make sense based on my previous Druid Ingestion JSON?

Druid Ingestion JSON Snippet:

“metricsSpec” : [{“type”: “count”, “name”: “count”}, { “type” : “hyperUnique”, “name” : “dim2_count”, “fieldName” : “dim2” }]

``

**Druid Query JSON:**

{

“queryType”: “groupBy”,

“dataSource”: “special_report-V1”,

“granularity”: “day”,

“dimensions”: [“dim1”],

“aggregations”: [

{ “type”: “hyperUnique”, “name”: “dim2_HyperUniqueCount”, “fieldNames”: [“dim2_count”], “byRow”:false },

{“type”: “count”,“name”: “count”}

],

“postAggregations”: [

],

“intervals”: [ “2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z” ]

}

``

Thanks @Gian for your input.

Unfortunately, when I run my now updated Druid Indexing Spec and Query below, I get a hyperUnique value of 0. Any suggestions?

I found the following helpful: https://groups.google.com/forum/#!msg/druid-user/DrAGNRUTtEg/kA39sCstBAAJ

Hi Mark,

Can you try changing:

{ “type”: “hyperUnique”, “name”: “dim2_HyperUniqueCount”, “fieldNames”: [“dim2_count”], “byRow”:false },

to:

{ “type”: “hyperUnique”, “name”: “dim2_HyperUniqueCount”, “fieldName”: “dim2_count”, “byRow”:false },

The hyperUnique aggregator only accepts a single field

Thanks,

Jon

Also byRow isn’t a field in the hyperUnique aggregator.

Thanks it worked! I missed the subtle fieldName declaration difference between this aggregation declaration and the others (http://druid.io/docs/latest/querying/aggregations.html). Nice catch.

My updated Query JSON is given below.

Druid Ingestion JSON Snippet:

"metricsSpec" : [{"type": "count", "name": "count"}, { "type" : "hyperUnique", "name" : "dim2_count", "fieldName" : "dim2" }]

**Druid Query JSON:**

{

“queryType”: “groupBy”,

“dataSource”: “special_report-V1”,

“granularity”: “day”,

“dimensions”: [“dim1”],

“aggregations”: [

{ “type”: “hyperUnique”, “name”: “dim2_HyperUniqueCount”, “fieldName”: “dim2_count” },

{“type”: “count”,“name”: “count”}

],

“postAggregations”: [

],

“intervals”: [ “2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z” ]

}

``

For those reading this post, I thought I would also include a link to a helpful article on aggregations: https://theza.ch/2015/04/05/introduction-to-indexing-aggregation-and-querying-in-druid/ .