Unique, new, return and repeat visitor counts

Hi - if we have really simple, standard website pageview events with these attributes:

  • url

  • userId

  • timestamp

Are there standard Druid best practices for ingest dimensions/aggregations and queries for calculating unique, new, return and repeat visitor counts, as defined in [1]?

For unique visitors, there seems to be two straightforward options:

  1. include userId as a dimension during indexing and then do a timeseries query with a cardinality aggregation on userId

  2. include a hyperUnique aggregation on userId during indexing and then do a timeseries query with a hyperUnique aggregation on userId

Does that sound correct for unique visitors?

[1] defines new visitor as having their first-ever visit during the reporting period. Using just the three attributes url, userId and timestamp, is there a way to query Druid for new visitor counts? Return visitor count should be a similar, but opposite query.

[1] defines repeat visitor as having two or more visits during the reporting period. Using just the three attributes url, userId and timestamp, is there a way to query Druid for repeat visitor counts?

Thanks,

Zach

[1] http://www.digitalanalyticsassociation.org/Files/PDF_standards/WebAnalyticsDefinitions.pdf

Hi Zach,

We don’t have any standard best practices to compute those types of metrics in Druid.

I’m not familiar with the details of the definitions you are referring to, but as far as you describe it, some of that is certainly possible.

Your suggestion 2) should work to get the unique number of visitors in any given timeframe, however, it won’t tell you if they are new, return or repeat.

It might be possible to intersect hyperUnique sketches to get the number of return visitors from one period to the next, but if the set of return visitors is small compared to the overall number of unique visitors, the approximation will be very bad.

You can include a userId dimension to filter on individual users, but it could get pretty large for high volume sites.

If you do, you can still leave the hyperUnique metric to compute uniques faster than having to rely on cardinality aggregations.

Calculating repeat visitors for arbitrary reporting periods is more tricky. One way would be to maintain a separate key-value store to count visits per user, and use that to create a dimension in Druid that you can filter on at ingestion time. I can’t think of an quick way to do that directly in Druid.

Thanks for the reply Xavier! That all makes sense.

The best I could come up with for new vs returning visitors is to enrich the pageview event outside Druid (e.g. a Samza job) with a firstVisit boolean field that is only true if this is the first pageview event ever for the userId. Then a timeseries query with a filter on firstVisit=true and hyperUnique aggregation on userId should provide the # of new visitors. A timeseries query with just hyperUnique aggregation on userId (no filters) should provide the # of unique visitors. Then the # of returning visitors is just # unique visitors - # new visitors. I think that has to be done as two separate queries - I can’t find a way to get both # unique visitors and # new visitors in the same query, since they use different filters.

If there are better/easier ways to compute new and returning visitors I would love to hear them!

Thanks,

Zach

Hi Zach

Hi - if we have really simple, standard website pageview events with these attributes:

  • url
  • userId
  • timestamp

Are there standard Druid best practices for ingest dimensions/aggregations and queries for calculating unique, new, return and repeat visitor counts, as defined in [1]?

For unique visitors, there seems to be two straightforward options:

  1. include userId as a dimension during indexing and then do a timeseries query with a cardinality aggregation on userId

Solution number 1 is definitely valid, in this case druid will apply HyperLogLog to estimate the cardinality of a dimension

  1. include a hyperUnique aggregation on userId during indexing and then do a timeseries query with a hyperUnique aggregation on userId

Solution number 2 is valid also but you have to ingest the data as a “hyperUnique” metric at indexing time.

I assume you have read this page http://druid.io/docs/0.7.0/Aggregations.html if not it is a good start.

More details about HperLogLog http://druid.io/blog/2014/02/18/hyperloglog-optimizations-for-real-world-systems.html and http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html

Thanks for the details Slim!

It seems like solution #2 would be much more efficient than solution #1. At indexing time for solution #1, Druid would create a separate row for every unique userId, but in solution #2 Druid would aggregate all of the userIds into a single HLL in one row. So solution #1 creates a lot more rows (ie larger segments) than solution #2.

Then at query time, for solution #1 Druid has to combine all of the separate rows into a userId HLL, while in solution #2 this HLL already exists in a single row that was created during indexing. So solution #1 takes more time to query than solution #2.

Does that sound accurate?

It seems like solution #2 would be much more efficient than solution #1. At indexing time for solution #1, Druid would create a separate row for every unique userId, but in solution #2 Druid would aggregate all of the userIds into a single HLL in one row. So solution #1 creates a lot more rows (ie larger segments) than solution #2.

Correct, with HLL you can benefit from aggregation so your segments will typically be smaller if a lot of userIDs combine into a single row.

Then at query time, for solution #1 Druid has to combine all of the separate rows into a userId HLL, while in solution #2 this HLL already exists in a single row that was created during indexing. So solution #1 takes more time to query than solution #2.

Yep, with the HLL already in place it will be faster to query, since part of the work has already been done at indexing time.