Hyperunique is giving different results on multiple execution of same query

While using hyperunique aggregation, I am getting different values on multiple execution of the same query.

Query:

{

  • “queryType”: “groupBy”,*

  • “dataSource”: “actives_data_source”,*

  • “granularity”: {*

  •   "type": "period", *
    
  •   "period": "P1D", *
    
  •   "origin": "2017-12-01T10:00:00.000Z"*
    
  • },*

  • “dimensions”: [“state”, “location_class”],*

  • “intervals”: “2017-12-11T10:00:00.000Z/2017-12-15T10:00:00.000Z”,*

  • “aggregations”:[*

  •   {"type": "hyperUnique", "name": "actives", "fieldName": "user_id"}	*
    
  • ]*

}

ResultSet:

Execution Attempt 1:

{

“version”: “v1”,

“timestamp”: “2017-12-11T10:00:00.000Z”,

“event”: {

“location_class”: “1L”,

“state”: “Bihar”,

“actives”: 19396.80653321458

}

}

Execution Attempt 2:

{

“version”: “v1”,

“timestamp”: “2017-12-11T10:00:00.000Z”,

“event”: {

“location_class”: “1L”,

“state”: “Bihar”,

“actives”: 19389.034881761647

}

}

Execution Attempt 3:

{

“version”: “v1”,

“timestamp”: “2017-12-11T10:00:00.000Z”,

“event”: {

“location_class”: “1L”,

“state”: “Bihar”,

“actives”: 19392.92087779155

}

}

On every execution I am getting one of the 3 highlighted values. Its random.

Observed same behaviour on following druid versions:

0.8.3

0.11.0

Has anyone observed this behaviour and what was the root cause.

HyperUnique uses HyperLogLog which is an estimation of cardinality. Perhaps your result value changes based on the order in which results from historicals are merged together. Since this is an estimation, there is a margin of error which is what you may be seeing here.

Kyle

Thanks Kyle. Is there any way that we can enforce some order so that we can get consistent results.

You could try enabling broker query caching for groupBy queries… that might be dangerous though. I also have no idea if this would actually impact the consistency of your results.

Kyle

you can try setting groupBy merging to happen in single threaded, the merge order might be same for consecutive executions with single threaded groupBy -
relevant property “druid.query.groupBy.singleThreaded” for runtime.props and “groupByIsSingleThreaded” when specifying in context.

docs - http://druid.io/docs/latest/querying/groupbyquery.html

Please note that it can affect performance negatively.

Thanks Nishant. Doing it with single thread worked.

Glad, it worked, Just keep an eye on the performance.

We are also seeing the same behavior which is the hyper-unique unique results differs multiple executions of the same query.

Unfortunately ‘groupByIsSingleThreaded’ does not work for us. tried with v1 and v2 group strategy but does not help.

We understand hyper-unique is approximate estimation but wonder is it deterministic?.

Also noticed if we query a single segment that always gives deterministic output. Looks like this behavior occurs only when it tries to union multiple segments of HLL.