Collaborative Filtering and Recommendation Services with Druid

Curious to see if anyone has tackled something similar when using Druid.

To be prevent being to verbose I have the following flattened data spec being ingested by Druid.

“sessionID”, “skuId”

The sessionID is always defined as it is the Druid short live session ID. In every segment the sessionId is available.

In some cases I have a null skuId. I only track a skuId when a user visits a product detail page.

I am trying to determine if Druid is suited for the below task.

I would like to query into Druid when a user hits a product detail page. Lets say the SKU value is 12345. During the remainder of the session they go to the following product detail pages.

skuId 54321, skuId 32145 and skuId 55555.

For all of the data events the person has the same sessionID. The data is written out to the spec with the sessionID and the skuId.

So in this case I have committed the following data:

sessionID - 0::Aasdas412asdasd123asdasd (made up ID)

skuId [54321, 32145, 55555, 12345]

I have another user come to the site and performs similar actions. In some cases goes to the same skuID as the other user but in some cases does not.

sessionID - 0::asd234sdfssaghhddafcd (made up ID)

skuId [54321, 666666, 55555, 897545]

I would like the ability in real time to perform an aggregation to build a recommendation query to state:

“All users who viewed this sku (54321) have also viewed these other sku’s” Essentially returning a weight(count) all skuIds that were viewed in all sessions that contained the datapoint sku value of 54321.

My thought was to query all of the sessionIds that have the skuID 54321. After that iterate over all sessionIds and grab all other skus that were viewed in those sessions. Take the count of each sku and show in order the skus that were reviewed during the session.

I have been looking at the datasketches library but that still seems to be slightly off with what I am trying.

I have tried a query of queries in Druid SQL but dynamic values don’t seem to be possible. (filtering on returned sessionIds)

I have tried filtered aggregations but cannot figure out how to grab an array of sessionIds and loop over to build a list of all ids.

Any help would be greatly appreciated.

Hi Brian,

Have you looked at Bloom filters?

https://druid.apache.org/docs/latest/development/extensions-core/bloom-filter.html

Eric

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

Hi Eric.
I have not. Have you had success with using it for the use case mentioned?

Thanks

Brian

Brian,

Are the number of SKUs fixed? If that is the case then you and do an intersect of the unique session ids for each SKU.

vijay

Hi Vijay.

They are not fixed. At any given moment new ones could be introduced.

You could compute the sketch for each SKU when ingesting and the intersect in the query.

vijay

Is there a tutorial for that?

Brian,

Below query is an example of intersecting users for sku1 with sku2 and sku3 in turn. The aggregations compute the theta sketch for each sku and then the intersect happens in the post aggregation. I have assumed sku to be a single value dimension.

{

“queryType”: “groupBy”,

“dataSource”: {

“type”: “table”,

“name”: “source1”

},

“intervals”: {

“type”: “intervals”,

“intervals”: [

“2019-11-19/2019-11-20”

]

},

“virtualColumns”:,

“filter”:null,

“granularity”:{

“type”: “all”

},

“aggregations”: [{

“type”: “filtered”,

“filter”: {

“type”: “selector”,

“dimension”: “sku”,

“value”: “sku1”

},

“aggregator”: {

“type”: “thetaSketch”,

“name”: “sku1_users",

“fieldName”: “user”

}

},

{

“type”: “filtered”,

“filter”: {

“type”: “selector”,

“dimension”: “sku”,

“value”: “sku2”

},

“aggregator”: {

“type”: “thetaSketch”,

“name”: “ru_users”,

“fieldName”: “user”

}

},

{

“type”: “filtered”,

“filter”: {

“type”: “selector”,

“dimension”: “sku”,

“value”: “sku3”

},

“aggregator”: {

“type”: “thetaSketch”,

“name”: “sku3_users",

“fieldName”: “user”

}

}],

“postAggregations”: [{

“type”: “thetaSketchEstimate”,

“name”: “sku1_intersect_sku2_users",

“field”: {

“type”: “thetaSketchSetOp”,

“func”: “INTERSECT”,

“fields”: [

{

“type”: “fieldAccess”,

“fieldName”: “sku1_users"

},

{

“type”: “fieldAccess”,

“fieldName”: “sku2_users"

} ,

“type”: “thetaSketchEstimate”,

“name”: “sku1_intersect_sku3_users",

“field”: {

“type”: “thetaSketchSetOp”,

“func”: “INTERSECT”,

“fields”: [

{

“type”: “fieldAccess”,

“fieldName”: “sku1_users"

},

{

“type”: “fieldAccess”,

“fieldName”: “sku3_users"

}]

}

}

]

}

HI Vijay.

Thanks for the help this is great. I took this and modified it to run the intersection. It is great and the counts came back just fine. Here is the sample payload that came back. I did omit the 3rd sku for brevity.

[{“version”:“v1”,“timestamp”:“2020-01-28T00:00:00.000Z”,“event”:{"sku1_intersect_sku2_users
":1.0,“sku2_users”:1.0,“sku1_users”:7.0}}]

As you can see above it did indeed return the aggregated count of where two product skus intersect between users.

In this case I am looking for something slightly different. The UserID I am using is a short lived sessionID (divolte). Ideally I am trying to build a recommendation engine.

The above intersection is great for CTR and a number of other business items we are looking to solve. This last one is a major piece for us.

For clarification let me break down the scenario, and again I am not sure if Druid is capable without some type of preaggregation, application layer, or really an ETL on the incoming streams of data from our web agent (Divolte).

We are trying to solve this issue.

  1. A user visits our site and goes to a product detail page.

  2. Lets say the user visits sku ID 12345

  3. I would then like to look at all of the other sessions that have visited the same skuID 12345 ( so I would have a list of many sessionIDs some belonging to this user and others will not)

  4. I would like to scan all other sessionIds that have the above skuId of 12345 as a visited sku and see all other skuids that were visited during all of those sessionIds in the system.

So visually

Lets say I have the following sessions

sessionID : 0:3723723848237

skuId 12345

skuId 54321

skuId 99999

sessionID: 0:8723871237981

skuId 12345

skuId 232323

skuId 99999

sessionID: 0:2364664666666

skuId 12345

skuId 67432

skuId 88888

sessionID: 0:xxxxxxxxxx

skuId 998786

skuId 67432

skuId 88888

So when the user visits the product detail page for 12345 I would be able to run a filter against all other sessions that have skuId 12345 and get all of the other skuIds that were in those sessions.

So the above query would yield something like this. All sessionIds for all users that have skuId 12345 also have these other skuIds

Resulting in an array of :

skuID: [12345, 54321, 99999, 12345, 232323, 99999, 12345, 67432, 88888]

The last sessionID : 0:xxxxxxxxxx would omit all skuIDs and not be in the return since it did not have a skuId of 12345 in the array of collected skuids for that session.

Hope this makes sense. I would also use the count for weight. So 12345 would count for 3, 99999 would have a count of 2 and everything else would be a count of 1.

I am asking a lot here I believe :slight_smile:

Thanks again.

Brian

Is the skuid a multi-value dimension? If that is the case you should able to retrieve all rows where one of the skuid matches 12345. Collecting all the ids in a single array is not doable as far as I know.

Hi Vijay.

Nope they are collected individually.

So as a user moves from page to page in a session, and they visit a product detail page it would write a single skuid to an event. All events can queried and aggregated. If I visit 5 different product detail pages throughout my session I would have 5 different events in my datasets.

Aggregating that is easy if I have one sessionId. In theory I can query one sessionID and get all skuIds pretty easily using druid.

Where I am running into issues is the query of a query concept.

I am trying to query all sessionIds that have this similiar skuid

Then

Query all skuIds (multiple entries) associated with all of those sessions from the initial query.

There is a feature in Elastic Search that I am trying to match. And Elastic requires an index to be built. Then they have queries you can run similar to this below. Again, maybe this is not the right tool but with the speed of Druid and the real time data access it would be great if it could match the feature. I understand sometimes there are better tools for each job, but I would like to rule this out completely before selecting something else. Also, I am not opposed to building out a different data stream if need be. Here is the elastic example.

I am assuming I would need some sort of index to really make this work, or atleast have the skuids in an array by sessionID. I currently have the same sessionId n times with a single skuId for each entry.

PUT recs/user/1
{    "movies_liked": ["Forrest Gump", "Terminator", "Rambo", "Rocky", "Good Will Hunting"]}

PUT recs/user/2
{    "movies_liked": ["Forrest Gump", "Terminator", "Rocky IV", "Rocky", "Rocky II", "Predator"]}

PUT recs/user/3
{    "movies_liked": ["Forrest Gump", "The Thin Red Line", "Good Will Hunting", "Rocky II", "Predator", "Batman"]}

PUT recs/user/4
{    "movies_liked": ["Forrest Gump", "Something about Mary", "Sixteen Candles"]}

Then you would run this query

POST recs/user/_search
{
    "query": {
        "match": {
            "movies_liked": "Terminator"
        }
    },
    "aggregations": {
        "movies_like_terminator": {
            "terms": {
                "field": "movies_liked",
                "min_doc_count": 1
            }
        }
    }
}

And the end result would look like this

"movies_like_terminator": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
        {
            "key": "Forrest Gump",
            "doc_count": 2
        },
        {
            "key": "Rocky",
            "doc_count": 2
        },
        {
            "key": "Terminator",
            "doc_count": 2
        },
        {
            "key": "Good Will Hunting",
            "doc_count": 1
        }
}

query of a query can be done by specifying the data source as a query in the json and then nesting the queries. I will try to send you an example.

the query . select skuid from btest where Session in (select Session from btest where skuid=‘12345’)

gives me

12345

54321

99999

12345

232323

99999

12345

67432

88888

Hello.

I made a swing at this and it is working great.

I added a count statement for anyone else that may ever be in search of a similar answer. Thanks again for all of your help.

This is an amazing product!

SELECT “sku”,
COUNT(*) AS “count”
FROM “tablename”
WHERE “sessionId” in (SELECT
“sessionId”
FROM “tablename”
WHERE “sku”=‘12345’)
AND “sku” != ‘’
GROUP BY “sku”