Datasketches - new approximate sketches for Druid

Hey guys, if you’ve been paying attention to the 0.8.3 release notes (https://github.com/druid-io/druid/issues/2044), we recently added a new module in Druid called DataSketches. A new set of aggregators have been added that revolve around a new sketch algorithm called the theta sketch. Theta sketches can be used to approximate the number of unique elements in a set, much like the ‘hyperUnique’ aggregator (based on hyperloglog). One big difference between the two algorithms is that hyperUnique can only support set union operations, whereas theta sketches can support union, intersection, A NOT B, and other types of set operations. At a high level, this means theta sketches can be used for much more complex set analysis, and for use cases such as retention analysis. Theta sketches require more space than hyperUnique aggregators, so if you only require count (distinct) like operations, hyperUnique is still the way to go. To learn more about theta sketches, check out the blog post:

As the Datasketches library grows and gains more approximate algorithms, we’ll be adding them to Druid. To learn more about Datasketches, check out the website:

http://datasketches.github.io/

One more interesting piece of information with theta sketches, you can get 100% accurate results if the number of entries is less than the number of buckets. This allows for 100% accuracy at low cardinalities, and reasonable estimates at higher cardinalities.

Very glad to hear that! Recently, I am struggling on implementing “feature revisit report” base on druid. It looks like that “Theta Sketches” is the “silver bullet” I am looking for!
I will try it soon~~

在 2015年12月27日星期日 UTC+8上午12:43:05,Fangjin Yang写道:

Hello,

Cool stuff! 2 questions:

  1. In the documentation example query on thetaSketch (link) where the goal is to find users with both products A and B - why couldn’t one simply use an “AND” operator on the query filter followed by a single thetaSketch aggregation instead of using the shown approach of using an “OR” filter followed by 2 filtered aggregations and a post aggregation? Is it just to show the possibilities of the thetaSketch or is the shown way the only correct one?
  2. Am I correct in assuming that unlike the thetaSketch aggregator the hyperUnique aggregator cannot be “tuned” for precision by the user?
    Thank you,

/David

Hey David,

  1. In that example the “product” dimension is single valued- so a user that visited both A and B would have two rows, one for A and one for B. In that case the “AND” requires doing an intersection of the sketch for A and the sketch for B. If “product” was multi-valued, then you could do an “AND” on the query level.

  2. That’s right, our implementation of HLL does not have any user accessible tunings.

Hello Gian,

Thank you for the explanation - makes perfect sense!

We are currently using hyperloglog unique counts over large cardinalities and on our cluster we observe that simple Druid queries like topn without filter on a single metric or timeseries on a single metric get extremely slow if we query more than just a few days of data. Currently, the query latency for hyperloglogs is an order of magnitude slower than queries on normal metrics. Querying 4 weeks worth of data with query- and segment granularity set to HOUR with 3million records per partition, roughly 10 partitions per hour and roughly 3.5 GB of segment volume per hour, rollup ratio of 20, a single topn hyperloglog query busies 15 historical nodes for several minutes,whereas the same topn query on a normal metric would take between 25 seconds and 5 seconds.

So obviously, we need to look into this matter, so my questions would be

  • is it normal that hlls take that long?
  • are theta sketches much faster than hyperloglogs on large time ranges and large cardinalities? I searched for performance comparisons relating to the Druid performance of sketches vs. hyperloglogs and couldn’t find anything.
  • we need to build a cluster for ad-hoc analysis and all queries should return in less than a couple of seconds. Is using hyperloglogs for unique counts the way to go at all? I tried to read up on how I can tune our hll performance, but didn’t find much that would give my any hope to get the minute latencies down to seconds and the current tests are just on 4 weeks of data whereas we need months of data eventually.
    thanks

Sascha

oh, one more:

on the datasketches website, I read about different forms of theta sketches: hash tabe vs. array based and alpha vs. quick-select updates.

Which of the setups is Druid using?

  • are theta sketches much faster than hyperloglogs on large time ranges and large cardinalities? I searched for performance comparisons relating to the Druid performance of sketches vs. hyperloglogs and couldn’t find anything
    haven’t seen a comparison but we mostly thetaSketch because it allows doing intersections and allows one to make speed-accuracy tradeoff. sketch merge operations in general very cpu intensive and generally queries with sketch based aggregators run slower than other primitive type metrics. To tune things you would have to see the metrics, see where time is spent actually and then go from there. pls see the metrics emitted by druid http://druid.io/docs/0.9.0/operations/metrics.html . Check your historicals and brokers for heavy gc activity and/or swapping.

  • we need to build a cluster for ad-hoc analysis and all queries should return in less than a couple of seconds. Is using hyperloglogs for unique counts the way to go at all? I tried to read up on how I can tune our hll performance, but didn’t find much that would give my any hope to get the minute latencies down to seconds and the current tests are just on 4 weeks of data whereas we need months of data eventually.
    Unfortunately doing things like “unique counts” is computationally hard and is expensive no matter what, see the note above for improving performance but it might not come in seconds for you months worth data use case.

  • Which of the setups is Druid using?
    druid thetaSketch module uses QuickSelect sketch

Thanks a lot Himanshu!

druid thetaSketch module uses QuickSelect sketch
I read that there is also the choice between sorted and unsorted representations. We’d be interested in the pre-sorted representation because we don’t care how much time is spent ingesting the data, but mainly how quickly it can be processed at query time. I imagine that the pre-sorted array based representation is a heap realized via fixed-sized array that uses array index arithmetric to arrange a fully balanced binary tree and that merging these k-min heaps is also fast because the zipping up of presorted heap like in the last step of a merge sort is an efficient thing. So I figure, that the thing with theta sketches that takes up most of the query time is the sheer data volume that needs to be moved.

Hi Guys,

We are facing issue while doing Retention Analysis via Datasketches. we are following this link.

Error we are getting here :

“error” : “Could not resolve type id ‘interval’ into a subtype of [simple type, class io.druid.query.filter.DimFilter]\n at [Source: HttpInputOverHTTP@433cf234; line: 24, column: 13] (through reference chain: java.util.ArrayList[0]->java.util.ArrayList[1])”

Normal datasketches queries are working fine for us only getting error in retention analysis.

DO we need to add any extension here ?

Hey Lovenish,

That link refers to code that is not yet in a Druid release. It is in master and will be in 0.9.2, but if you want to use it now, you’d need to build your own version of Druid from the current master branch. You can do this with “mvn package”.

Thanks Gian !

Hi FJY, does somewhere have thetaSketch benchmark on druid ? I found thetaSketch query is too slow.

在 2015年12月18日星期五 UTC+8上午5:35:14,Fangjin Yang写道:

Hi Skyler,

Unfortunately there is no benchmark but we do have different queries that run in seconds to some that take minutes. from what I remember, in most performance issues we noticed high GC overhead and tweaked jvm memory settings as much we could.

in general, i would watch the metrics and try to understand whether most time is spend in scanning the segments at historicals, waiting or merging at brokers. I understand this might be hard to do without proper monitoring setup. We are trying to provide a simpler solution to investigate query slowness with https://github.com/druid-io/druid/issues/3324 .

– Himanshu