DataSketches Internal Working

Hi All,

I have read the Yahoo Documentation which describes the internal working of Theta Sketch Algorithm.

But I am not able to co-relate it with how Druid implements it. Basically How does Druid creates theta sketch at the time of ingestion.

What I am trying to understand is , how does the filtering on different dimension and theta sketch on different dimension work in Druid Query Single Aggregator.

Does Druid Filter the rows based on the filters given in a filtered aggregator and then applies theta sketch provided in the same filtered aggregator on it (But How ??? Because the Theta Sketch Set in Druid would have been already created at the Data Ingestion time ).

Could you anyone please answer this ?


Pravesh Gupta


Dimensions, that you need to filter on, must be ingested as dimension columns. Then Druid would have different rows for different dimension values and each row will have its own thetaSketch. At query time, some rows would be skipped as per the filters provided and appropriate thetaSketch merging would happen if required.



Hey Himanshu,
Thanks for the response and Sorry for the late reply.

What do you mean by “each row will have its own thetaSketch” ?? Isnt thetasketch a separate Set whose size is K (Given at ingestion time) which contains numbers between 0 and 1 (Based on Gaussian Distribution) ?


Pravesh Gupta

Also , to add to above Question,
What all Druid does at the Query Time if we have given the theta sketch at the ingestion time ?