Druid theta-sketches module returns wrong results

I have Spark job that builds parquet files containing theta sketches as raw bytes.
Then I am ingesting this data to Druid via Hadoop.

For most of the queries, I am getting pretty close results, but other ones failing hard.

What is interesting here that total looks fine.

When I run groupBy query filtering by the problematic app_id (abc) and splitting further by country I am getting the following results:

{

“version”: “v1”,

“timestamp”: “2019-07-06T00:00:00.000Z”,

“event”: {

“ltv_country”: “IN”,

“users_count”: 10045608,

"users_count_approx": 4096.0,

“events_count”: 87055162,

“app_id”: “abc”

}

},

{

“version”: “v1”,

“timestamp”: “2019-07-06T00:00:00.000Z”,

“event”: {

“ltv_country”: “ID”,

“users_count”: 1400582,

"users_count_approx": 4096.0,

“events_count”: 13800394,

“app_id”: “abc”

}

},

{

“version”: “v1”,

“timestamp”: “2019-07-06T00:00:00.000Z”,

“event”: {

“ltv_country”: “RU”,

“users_count”: 198689,

"users_count_approx": 4096.0,

“events_count”: 1823982,

“app_id”: “abc”

}

}

``

You might think there is something with the data, but topN queries working just fine:

{

“users_count”: 12015316,

“users_count_approx”: 1.2113636560511928E7,

“events_count”: 106109552,

“app_id”: “abc”

},

{

“users_count”: 10179750,

“users_count_approx”: 1.0393982229777606E7,

“events_count”: 103264874,

“app_id”: “other_app”

}

``

And with Spark, I am able to calculate correct results, so data files are fine.

There are no errors during ingestion and during queries.

Side note, for this specific data source it is fine to groupBy and sum count distinct values.

Thoughts?

A simplified Spark job example:

dataset.groupBy(“app_id”, “ltv_country”).agg(countDistinct(“id”), theta(“id”)).write.parquet(s3://…)

``

Simplified ingestion spec:

“type”: “index_hadoop”,

“spec”: {

“ioConfig”: {

“inputFormat”: “org.apache.druid.data.input.parquet.DruidParquetInputFormat”

}

}

“parser”: {

“type”: “parquet”,

“parseSpec”: {

“format”: “timeAndDims”,

“columns”: [

“timestamp”,

“app_id”,
“ltv_country”,

“id”

],

“dimensionsSpec”: {

“dimensions”: [

“app_id”

]

},

“timestampSpec”: {

“format”: “auto”,

“column”: “timestamp”

},
“metricsSpec”: [{

“type”: “thetaSketch”,

“name”: “users_count_approx”,

“fieldName”: “id_theta”,

“isInputThetaSketch”: true,

“size”: 4096

}, {
“type”: “longSum”,

“name”: “users_count”,

“fieldName”: “id_distinct”

}

]

``

I am using the same version of data-sketches - 0.13.3

It seems like a known issue fixed by this: https://github.com/apache/incubator-druid/pull/7666
It must be a part of 0.16.0 release.

In the meantime you may want to replace sketches-core-0.13.3.jar with sketches-core-0.13.4.jar manually

It fixed the issue, thank you.