Re: [druid-user] Querying events by the occurrence of another event

Assuming you have the following structure

__time,userid,event_type

The following query should do it

select THETA_SKETCH_ESTIMATE(THETA_SKETCH_INTERSECT(theta_blog,theta_order)) from (select ds_theta(userid) filter (where event_type=‘blogviewed’) theta_blog,select ds_theta(userid) filter (where event_type=‘order_placed’) theta_order from source)

you can make the filter more complex by adding a blog name etc. The above will give intersect for the full data source. You could slice it by time extract for week etc.

vijay

Hi Vijay,

Thanks for the reply.

I tried the suggested query, but it counts distinct users matching both filters, meaning how many users caused both events.

We are looking to answer how many of the “blogViewed” events are associated with those users. Or, in other words, we’re trying to count events and not users.

Is there a way to do it?

the tuple sketch should do it. The below query is on the wikipedia dataset and does two things

  1. distinctcount of distinct users by country who contributed to en.wikipedia and es.wikipedia.
  2. mean of the “count” metric by each country (mean = total count/distinct user count)

so if you multiply intersect_mean*intersect_est in the below query you will get count of events of users contributing to both wikipedia channels
This is available only in the native query

{
“dataSource”: “wikipedia1”,
“queryType”: “groupBy”,
“intervals”: [
“2016-06-27T00:00:00.000Z/2016-06-27T23:00:00.000Z”
],
“granularity”: “all”,

“aggregations”: [
{ “type”: “filtered”,
“aggregator”:
{
“type”: “arrayOfDoublesSketch”,
“name”: “sketch_en”,
“fieldName”: “user”,
“metricColumns” : [“count”],
“nominalEntries”: 65536
},
“filter”: {
“type”: “selector”,
“dimension”: “channel”,
“value”: “#en.wikipedia”,
“extractionFn”: null
}
},
{ “type”: “filtered”,
“aggregator”:
{
“type”: “arrayOfDoublesSketch”,
“name”: “sketch_es”,
“fieldName”: “user”,
“metricColumns” : [“count”],
“nominalEntries”: 65536
},
“filter”: {
“type”: “selector”,
“dimension”: “channel”,
“value”: “#es.wikipedia”,
“extractionFn”: null
}
}
]

Hi Vijay,

That looks okay for roughly estimating things like contributors but seems too rough to be useful for counting events like orders. Isn’t there any way to count with better accuracy (at least the same as offered by HLL or Theta)?

Bests,
Marcos

not as far as I know. The basic issue with your requirement is that you need to intersect the users and count the events the users participated in. Essentially you need to estimate metrics along with count. The tuple sketch is the only one that can do this (even here you get count of all events the users participated in, not events of one type alone.

vijay

the tuple sketch on the wikipedia data set gives the result of below.

select count(*) from (select “user” from wikipedia1 where user in (select distinct “user” from wikipedia1 where channel=’#en.wikipedia’)
and channel=’#es.wikipedia’)

one approach you could take is create a an event lookup…this will keep only the latest user, time of blog_viewed_event.

Then a query like
select count(user) from events where lkp(user,‘event_lookup’)!=’’ and type=‘order_placed’

you will need a jdbc or a kafka lookup. I think this should work but not sure how large the lookup will get. However since the lookup keeps only one event for a user it may be small enough.

I think it is either the lookup or join. Can’t think of any other way.