Lookalike Query

We would like to use Druid for analyzing viewership and are having a hard time figuring our how to support lookalike analysis. For example, given the following dataset we’d like to see what networks someone that watched ESPN also watched:

{"timestamp": "2011-01-12T00:00:00.000Z", "user": 1, "network": "ESPN", "time_spent": 30, "demographics": ...}
{"timestamp": "2011-01-12T00:00:00.000Z", "user": 1, "network": "ABC", "time_spent": 5, "demographics": ...}
{"timestamp": "2011-01-12T00:00:00.000Z", "user": 1, "network": "HGTV", "time_spent": 15, "demographics": ...}
{"timestamp": "2011-01-12T00:00:00.000Z", "user": 3, "network": "ESPN", "time_spent": 30, "demographics": ... }
{"timestamp": "2011-01-12T00:00:00.000Z", "user": 2, "network": "ESPN", "time_spent": 30, "demographics": ...}
{"timestamp": "2011-01-12T00:00:00.000Z", "user": 1, "network": "ESPN", "time_spent": 13, "demographics": ...}


So for user 1, we’d want to get back ESPN, ABC and HGTV. If we use a filter, we’d only get back the rows where ESPN was in the network field which defeats the purpose. In our RDBMS today, we’re able to do this using a multi-pass SQL approach by populating a temporary table with all the users that watched ESPN and then joining that back on the viewership table to see everything they watched, but this list can get into the millions so it doesn’t feel like the right approach to take on Druid.

Any other ideas or options we haven’t considered? Is this possible or do we need to load the data into a different schema in order to support this use case?

Hi Tim,

For a modest number (~ hundreds of thousands) of users then I would suggest a Druid SQL query like:

SELECT DISTINCT “network” FROM tbl WHERE “user” IN (SELECT “user” FROM tbl WHERE “network” = ‘ESPN’)

Millions might work if you raise the druid.sql.planner.maxSemiJoinRowsInMemory config and possibly heap sizes as well. (The list of users ends up needing to get transferred to the Java heap on the Druid servers.)

But also, if you can deal with an approximation, you’re in luck: we have been working recently on enabling this kind of thing on large (millions +) sets of users through an approximate join using bloom filters. Check out http://druid.io/docs/latest/development/extensions-core/bloom-filter.html for some details on the (partial) work that has shipped recently. Future work includes tying it into Druid SQL for queries like the above, and making a bloom filter aggregator that lets you build bloom filters at query time. But there should be enough there in the current version to play around with a bit.