How to query number of users who have visited exactly N unique Pages

I have input data of the format

Timestamp,User Id,Page Id

where each input tuple represent a page visit by a user.

Given the cardinality of “UserId” and “PageId” could be huge what is the effective way of querying this data if i want to get result of the following query :

“How many users have visited exactly 3 unique Pages?”.

For eg: if input it this :

2016-02-09, U1,P1

2016-02-09, U2,P1

2016-02-09, U1,P2

2016-02-09, U1,P3

2016-02-09, U2,P2

2016-02-09, U2,P1

In this case answer should be 1 user(i.e U1) has visited exactly 3 unique pages(P1,P2,P3)

Is there a way to do such analysis in druid?

Any pointers here?
Is such a query possible with druid even if it requires addition of a new query type?

I would be happy to contribute in case this requires some feature addition in Druid.

Thanks

Rohit

Have you thought about using sketches ?

I thought but couldn’t formulate this problem in terms of sketches.

Any help would be of great help.

–Rohit

i am not sure if this can really fit your use case

but assume you build a sketch of unique user per page.

So now if the user ask want to know the number of unique user whom visited p_1 and p_3 and p_x what you have to do is to compute the inter section of sketch_p1 and sketch_p3 and sketch_px.

i know it is abstract idea but i bet it will work.

please let me know if you get it working or if you have other way to do it :smiley:

Slim,

I was thinking on similar lines but there are two problems with this approach

  1. I need to be aware of all the pages upfront,so that i can define them as sketch aggregations in the ingestion flow.

  2. To find number of users who have visited exactly 2 products we need to take intersections of all pairs of such sketches,which could be a problem given number of pages are of order of 20K.

I couldn’t find solution to any of these 2 problems hence couldn’t proceed.

Thanks

Rohit

I’d suggest a slightly different query. Use the groupBy query to run a query grouped by userId and using the sketch aggegator on pageid. Then apply a having filter on the output of the sketch aggregator. Given that sketches are approximate and depending on the sketch size you use, you might need to write the having in terms of “expectedCount - epsilon <= sketchSize <= expectedCount + epsilon” instead of sketchsize = expectedCount. This will return rows for all the users who have visited expectedCount pages. If you want a count of the users, use groupBy’s nesting feature and run the sketch on page in the inner query and use the sketchAggregator (or maybe its a cardinality aggregator) against the user dimension to get the count of users.

The one thing you’ll need to worry about with all of this is the size of the user dimension. If the cardinality of the user dimension is too high, you may exceed groupBy’s maximum intermediate rows or maximum result rows for the inner query. It could also potentially require a huge amount of memory on your broker node.

Will

Will Lauer
Tech Yahoo, Software Dev Eng, Principal
P: 217.255.4262 M: 508.561.6427
1908 S First St. Champaign IL 61820