[GroupBy] [Multi-value dimension + Single dimension] Multiple records in result for the same group of dimensions on realtime

Hi folks,

I have a strange situation with a groupBy by 2 dimensions, one of them being a multi-value:

  • multiValDim (ingested with [“iphone”] or [“web”] or not present at all)
  • stringDim (ingested with values “1” or “2”)

The following groupBy query:

{

“queryType”: “groupBy”,

“dataSource”: “test-datasource”,

“intervals”: “2800-01-01T01:00:00.000Z/2800-01-01T02:00:00.000Z”,

“dimensions”: [“multiValDim”,“stringDim”],

“aggregations”: [

{

“type”: “longSum”,

“name”: “events”,

“fieldName”: “events”

}

],

“limit” : 2147483647,

“granularity”: “all”

}

produces the next output:

[

{

"version": “v1”,

"timestamp": “2800-01-01T01:00:00.000Z”,

"event": {

"stringDim": “2”,

"events": 1

}

},

{

"version": “v1”,

"timestamp": “2800-01-01T01:00:00.000Z”,

"event": {

"stringDim": “1”,

"events": 510

}

},

{

"version": “v1”,

"timestamp": “2800-01-01T01:00:00.000Z”,

"event": {

"stringDim": “2”,

"events": 1

}

},

{

"version": “v1”,

"timestamp": “2800-01-01T01:00:00.000Z”,

"event": {

"stringDim": “1”,

"events": 490

}

},

{

"version": "v1",

"timestamp": "2800-01-01T01:00:00.000Z",

"event": {

  "multiValDim": "iphone",

  "stringDim": "1",

  "events": 10001

}

},

{

"version": "v1",

"timestamp": "2800-01-01T01:00:00.000Z",

"event": {

  "multiValDim": "web",

  "stringDim": "1",

  "events": 1001

}

}

]

The problem is that for the combinations stringDim + multiValDim where the multiValDim lacks (not set) i get more groups in the result (see the bold records).

The problem happens only on realtime.

After the handoff it does not reproduce.

Additional context and observations:

  • multiValDim has either 1 value or is not set at all.
  • During realtime in metadata multiValDim field had hasMultipleValues=true but after handoff it became hasMultipleValues=false. Not sure it’s related to the problem. My guess is that during handoff, druid sees that there is no more than one value in the field and sets the hasMultipleValues to false.

Here is a screenshot. Pay attention that I have 2 groups for zoneId “3” and zoneId “5”. How is it possible?

I remind that all the queried data is in 2 realtime partitions (not handed-off yet) and it does reproduce randomly so that I have to push randomly thousands of records with and without the multiValueDimension set.