Filtering with sketch results

Is it possible filtering with sketch results?

I want to know the metrics (uv, pv, events) of users who joined last week.

select count(distinct userid) as uv, sum(pv) as pv, sum(events) as events
from weblog
where userid in (
select userid
from weblog
where join_date between ‘2018-11-01’ and ‘2018-11-07’
)

and dt between ‘2018-11-08’ and ‘2018-11-14’

``

Hey Jay,

If all you wanted to do was figure out the count(distinct userid) then count-distinct sketches (either hll or theta) could do this. But filtering is a different matter.

There’s some work being done recently to do this sort of join approximately using bloom filters, which might be interesting. This PR is part of it: https://github.com/apache/incubator-druid/pull/6502. The idea would be that you build up a bloom filter for the subquery, and then use that bloom filter as a filter for the outer query. You could control the error rate through the sizing of the bloom filter. This work isn’t a complete solution yet but there is enough there in master to play with (assuming you make the bloom filter externally to Druid). If you are interested in that, you might want to hop over to the dev list (dev@druid.apache.org) to talk further.

Hi Gian,

“userid” is a high cardinality field so It can not be used as a dimension.

A different approach may be needed to resolve this issue.

Hi, I think you might be able to get an answer though Theta Sketches, using Sketch Operations between the two groups.

The Theta Sketch in Druid documentation:

http://druid.io/docs/latest/development/extensions-core/datasketches-theta.html