Druid Theta Sketch estimates are wrong when computing sketches in Spark

Hi,

I am computing the theta sketch from Spark using DataSketches(sketches-core) library. I am getting the correct results when I print the estimates.

Now I want to basically import these daily computed sketches through Spark in druid, so that I can then perform queries on this for large time range in druid.

I imported these sketches by specifying “isInputThetaSketch” as true in the indexing file. But when I run the query on my datasource my estimates are always coming as maximum 4096. My guess is I will have to specify the sketch size also while indexing, I didn’t specify explicitally the sketch size anywhere neither in my spark code nor while ingesting. Any idea how to solve this issue.

Thanks.

You can check here to see what sizes you can use https://datasketches.github.io/docs/Theta/ThetaSize.html

Rommel Garcia

Hi Rommel,

I have checked the link before. When I print the sketch(without specifying nominal entries explicitly) in Spark I get following result :

DirectCompactOrderedSketch SUMMARY:

Estimate : 10494.730936959068
Upper Bound, 95% conf : 10755.755191550894
Lower Bound, 95% conf : 10239.986700986667
Theta (double) : 0.39029109222564295
Theta (long) : 3599799946267503701
Theta (long) hex : 31f50efa83beac55
EstMode? : true
Empty? : false
Array Size Entries : 4096
Retained Entries : 4096
Seed Hash : 93cc

END SKETCH SUMMARY

After I explicitly set the nominal entries by UpdateSketch.builder().setNominalEntries(524288).build(). Then I also I got the same result.

DirectCompactOrderedSketch SUMMARY:

Estimate : 10494.730936959068
Upper Bound, 95% conf : 10755.755191550894
Lower Bound, 95% conf : 10239.986700986667
Theta (double) : 0.39029109222564295
Theta (long) : 3599799946267503701
Theta (long) hex : 31f50efa83beac55
EstMode? : true
Empty? : false
Array Size Entries : 4096
Retained Entries : 4096
Seed Hash : 93cc

END SKETCH SUMMARY

However I suspect that since Array Size Entries & Retained entries are 4096 that’s why my results are capped at 4096 when I query in druid. I could be wrong here. Please suggest what should I do to get proper results in druid. When I printed in Spark I am getting proper estimates.

Himanshu,

I haven’t used Spark before for the sketches but try increasing the array size entries. Let me know how it goes.

Rommel Garcia

In my above reply I explicitly tried to set nominal entries hoping that array size & retained entries would increase, but for some reason that did not happen.

Please always specify versions when reporting such things.
Thank you.

Apologies for not mentioning the versions.

Here are the following version I am using

Druid : 0.14.0-incubating

Spark : 2.11-2.4.0

Datasketches(sketches-core) - 0.10.0

Do your estimates or nominal entries appear to be wrong?
Could you print a sketch that seems to be wrong?

I am not saying that the sketches are wrong, instead I am not getting the results I expect. This is the sample sketch which I printed through my spark application.

DirectCompactOrderedSketch SUMMARY:

Estimate : 10494.730936959068
Upper Bound, 95% conf : 10755.755191550894
Lower Bound, 95% conf : 10239.986700986667
Theta (double) : 0.39029109222564295
Theta (long) : 3599799946267503701
Theta (long) hex : 31f50efa83beac55
EstMode? : true
Empty? : false
Array Size Entries : 4096
Retained Entries : 4096
Seed Hash : 93cc

END SKETCH SUMMARY

Here the estimate is right, but when I ingest this data in druid, and when I query in druid, I get the result as 4096(if the actual estimate is less than 4096, then I get the right estimate, whenever it’s more than 4096 I get the result as max 4096).

I am suspecting since the sketch summary printed by my spark application has array size & retained entries as 4096, is this why I am getting the max estimate as 4096 in druid. I am actually very new to both druid and datasketches, so my knowledge is very limited and so my assumptions can be very naive here.

Please suggest why I am not getting proper estimates when querying data from druid.

When ingesting data are you setting nominal entries value…can you post ingestion spec

No I am not setting nominal entries value or sketch size anywhere. Below is my ingestion spec.

{
“type”: “index”,
“spec”: {
“dataSchema”: {
“dataSource”: “spark29may”,
“parser”: {
“type”: “string”,
“parseSpec”: {
“format”: “tsv”,
“columns”: [
“ts”,
“seg”,
“thetaS”
],
“dimensionsSpec”: {
“dimensions”: [
{
“name”: “seg”,
“type”: “long”
}
]
},
“timestampSpec”: {
“column”: “ts”,
“format”: “posix”
}
}
},
“metricsSpec”: [
{
“name”: “uniq_user”,
“type”: “thetaSketch”,
“fieldName”: “thetaS”,
“isInputThetaSketch” :true
}
],
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “day”,
“queryGranularity”: “minute”,
“rollup”: true
}
},
“ioConfig”: {
“type”: “index”,
“firehose”: {
“type”: “local”,
“baseDir”: “/root/himanshu/”,
“filter”: “thetaSketchResult29May”
},
“appendToExisting”: false
},
“tuningConfig”: {
“type”: “index”,
“maxRowsPerSegment”: 5000000,
“maxRowsInMemory”: 25000,
“forceExtendableShardSpecs”: true
}
}
}

In my above spec “thetaSketchResult29May” is the file containing thetaSketch(“thetaS” column) computed by my Spark Application using datasketches library.

Would it be possible to try with the latest version of Druid?
https://github.com/apache/incubator-druid/releases/tag/druid-0.14.2-incubating

can you please try with setting size. Refer to page https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html

Hi Anuj,

I tried setting the size in ingestion spec as well as in my query json. But results were same.

Hi Alexander,

Is there any issue related to this with this specific version I am using?

Hey AlexanderI used druid-0.14.2-incubating. Still my results are capped at 4096

I also tried timeseries query, the results are coming fine. But things are going for a toss when I run groupBy queries.

Is there any issue related to this with this specific version I am using?

I am not sure. It superficially looks similar to the issue https://github.com/apache/incubator-druid/issues/7607

that was fixed in 0.14.2-incubating, but I don’t believe it was present in 0.14.0-incubating.

We need some way of reproducing the issue. Unfortunately I don’t see a sketch-to-string debug printing post agg for Theta sketch. I was under the impression we had one. I will fix this shortly.

Perhaps someone from Druid development team could help reproducing this issue. With that previous issue above, Gian Merlino was very helpful.

Would it be possible to try the latest release candidate 0.15.0-incubating-rc2?

Hi Himanshu,

This certainly does sound like a bug, and on the surface a very similar issue to https://github.com/apache/incubator-druid/issues/7607 that Alexander Saydakov linked in this thread, which we believe was fixed by https://github.com/apache/incubator-druid/pull/7619 and went out in 0.14.2. However, there was another fix, https://github.com/apache/incubator-druid/pull/7666, which did not make it into 0.14.2 (or even 0.15.0 because it was merged after the branch was cut) which might be relevant. It might be worth trying to update to the datasketches jar in the extension folder to 0.13.4+ to see if that solves your issues, all you should have to do is replace the version that shipped with 0.14.2 and give that a try. I suspect the version you use with Spark would need to match - I’m not certain on this, but would be nice to keep moving parts to a minimum.

If that doesn’t work, would you be willing to help us produce a set of data that triggers the issue to expedite fixing it as quick as possible? Maybe something along the lines that the reporter of did in issue 7607, described in this comment https://github.com/apache/incubator-druid/issues/7607#issuecomment-490337462?

Thanks,

Clint

Hi Clint,

Here is how you can reproduce this issue.

Below is the java code to generate sample data.

package com.abc;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;

public class TestDataGenerator {

public static void main(String[] args) {
    try{
        BufferedWriter bw = new BufferedWriter(new FileWriter(new File("/tmp/testData")));
        for(int i = 0; i<1000000; i++){
            bw.write("123\t"+i+"\n");
            bw.write("1234\t"+i+"\n");
        }
        bw.close();
    } catch(Exception e){
        e.printStackTrace();
    }
}

}

Below are the two classes which I used to compute theta sketch in Spark :

ThetaSketchJavaSerializable.java

package com.abc;

import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;

import com.yahoo.memory.Memory;
import com.yahoo.sketches.theta.CompactSketch;
import com.yahoo.sketches.theta.Sketch;
import com.yahoo.sketches.theta.Sketches;
import com.yahoo.sketches.theta.UpdateSketch;

public class ThetaSketchJavaSerializable implements Serializable {

  /**

Hi Clint,

If I have to update the sketche-core jar to 0.13.4+, which all jars I will have to update in druid 0.14.2-incubating extensions/druid-datasketches folder.

These are the following jars present in druid-datasketches extension folder :

commons-math3-3.6.1.jar

druid-datasketches-0.14.2-incubating.jar

memory-0.12.2.jar

sketches-core-0.13.3.jar

slf4j-api-1.6.4.jar

Just updating the sketches-core jar will do or do I need to do anything extra.