Druid ThetaSketch Related Query


I have query related to behavior of druid thetaSketch .

I have total unique records in file = 594081 . When i ingest file to druid with thetaSketch default size=16384 , i got unique user count = 589149 . so i got bug percent near around = 0.8% .

Next time i increase thetaSketch size=65536 (4 time to default) , i got unique user count = 585288 (during query time i used size=65536) which is less then earlier got count (589149 with default thetaSketch Count) .

So here is my query:-- As druid doc http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators , if need higher accuracy , i have to increase high size (power of 2) . That i did but i am not getting high accuracy instead bug percent increase. So want to know is it possible??

Hi Ashutosh!
Theta sketch is a stochastic algorithm, and the specifications of the error bounds are statistical in nature. To evaluate accuracy one would need to run thousands of trials and analyze the distribution of the error.

Regarding this particular trial of yours, the error from the 64K sketch was 1.48%, which is outside of the 99% confidence interval, but not impossible. We expect the error to be within 1.17% interval 99% of the time, and within 0.78% interval 95% of the time.

Here is the accuracy table: