Getting ThetaSketches out of druid

Hi,

Let say I index a field in my source using thetaSketch. When I run a query using this field, is it possible to get thetaSketch in the response for this field out of druid? The idea is to use this thetaSketch in the response and load it into the datasketches library and perform further computations.

Thanks,
Arjun

You can pass “finalize”: false in your query context and that will prevent the sketches from being collapsed into numbers. You should get something base64-encoded that should be loadable into DataSketches in some way.

Btw, you may also be able to do the computations you want using post-aggregators. Druid includes DataSketches postaggs for set intersection, union, and difference.

Perfect! Thank you!

Arjun

I am trying to deserialize the base64 encoded string now. I see that there is a druid package that has some code for doing this:

https://search.maven.org/#search%7Cga%7C1%7Ca%3A%22druid-datasketches%22

This has the following method in io.druid.query.aggregation.datasketches.theta.SketchOperations

static com.yahoo.sketches.theta.Sketch
deserializeFromBase64EncodedString([String](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html?is-external=true) str)

Exception in thread “main” com.yahoo.sketches.SketchesArgumentException: Unknown Serialization Version: 0

at com.yahoo.sketches.theta.Sketch.heapify(Sketch.java:283)

at com.yahoo.sketches.theta.Sketch.heapify(Sketch.java:257)

at com.yahoo.sketches.theta.Sketches.heapifySketch(Sketches.java:44)

at Test.Test.deserializeFromMemory(Test.java:32)

at Test.Test.deserializeFromByteArray(Test.java:26)

at Test.Test.deserializeFromBase64EncodedString(Test.java:17)

at Test.Test.main(Test.java:12)

I replaced com.yahoo.sketches.memory.Memory to com.yahoo.memory.Memory but then I get the following exception:

Does that work if you use the same DataSketches version that Druid ships with?

Good point, but I used the jar files provided as part of imply-1.3.0 and I still get that error. Does this deserialization work for fields that have been indexed using hyperUnique in druid as well? Or is this only for theta Sketches?

Here are the jar version details:

export CLASSPATH="/home/nsadmin/imply-1.3.0/dist/druid/extensions/druid-datasketches/druid-datasketches-0.9.1.1.jar:/home/nsadmin/imply-1.3.0/dist/druid/extensions/druid-datasketches/sketches-core-0.2.2.jar:/home/nsadmin/Test:/home/nsadmin/guava-20.0.jar:/home/nsadmin/commons-codec-1.10.jar:/home/nsadmin/imply-1.3.0/dist/tranquility/lib/com.metamx.java-util-0.27.9.jar:/home/nsadmin/imply-1.3.0/dist/druid/extensions/druid-kafka-indexing-service/slf4j-api-1.7.6.jar:/home/nsadmin/Test:."

nsadmin@mpdruid01:~$ javac Test/Test.java

nsadmin@mpdruid01:~$ java Test.Test

SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”.

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Exception in thread “main” java.lang.IllegalArgumentException: Unknown Serialization Version: 0

at com.yahoo.sketches.theta.Sketch.heapify(Sketch.java:246)

at com.yahoo.sketches.theta.Sketch.heapify(Sketch.java:220)

at com.yahoo.sketches.theta.Sketches.heapifySketch(Sketches.java:37)

at io.druid.query.aggregation.datasketches.theta.SketchOperations.deserializeFromMemory(SketchOperations.java:82)

at io.druid.query.aggregation.datasketches.theta.SketchOperations.deserializeFromByteArray(SketchOperations.java:76)

at io.druid.query.aggregation.datasketches.theta.SketchOperations.deserializeFromBase64EncodedString(SketchOperations.java:67)

at Test.Test.main(Test.java:8)

Here is the source I am using:

package Test;

import io.druid.query.aggregation.datasketches.theta.SketchOperations;

import com.yahoo.sketches.theta.Sketch;

public class Test {

public static void main(String args) {

Sketch sketch = SketchOperations.deserializeFromBase64EncodedString(“AQAADwAAAAAKAgAcEAAmBABcAQBnEAB4EACaEADlUAHIEALOEAMaEAM/BQNXEANvQANyEA==”);

}

}

Ah, it’s definitely only going to work for theta sketches. HyperUnique uses a different algorithm and serialization format that is not part of DataSketches. Check out HyperUniquesSerde in the Druid code base to see how that works.

I see. I took a look at HyperUniquesSerde and that doesn’t look like I can use that for my use case for fields indexed as hyperUnique in Druid right? I want to get the query response and get the hyperUnique fields in the response and then deserialize the hyperUnique field so that I can add more values to it. How can I do that with hyperUnique in Druid? If not, I will just revert to using thetaSketch indexes.

Arjun

You should be able to do that, but you’ll need to use Druid classes to interpret our variant of HLL. When you have a HyperLogLogCollector (see HyperLogLogCollector.makeCollector) then you can call “add” on it to add more values.