Retention analysis problem

Hi all,
I use druid 0.9.0 stable version, and does 0.9.0 support retention analysis ? If yes, how ?

I see below topic, and it says thetaSketch will be used to retention analysis.!searchin/druid-user/retention|sort:date/druid-user/cWGsgJKgg8o/-FJzlYctEAAJ

In thetaSketch post aggregation example in there, it shows the unique users which visited both product A and product B in fixed intervals.

But the problem is, in my understanding, retention analysis should has a cohort in certain intervals like [2016-06-01/2016-06-03], and should query results in other intervals like following some days as [2016-06-03/2016-06-05].

So can I set two or more intervals in thetaSketch aggregation ?

Or I’m wrong ?

Any reply is appreciative !

Hey Skyler,

I think this is not currently possible out of the box. Applying the technique from the docs to retention would require filtered aggregators being able to filter on time, which is not possible right now (although will likely be possible in the future). There might be another alternative technique I’m missing, maybe someone else can chime in with thoughts there.

Hi @Skyler Tao,

We are E-Commerce based company. We need to do retention analysis. We are doing this using a offline processor. It’s a nodeJS Kafka consumer which listens to traffic and keeps track of user, when a new user id coming to us on that particular day, we will insert one more event with his first purchase date and last purchase date. There might be other good ways to do. But i’m not getting any as of now. I hope it will support aggregation based on time intervals.

If you want to month on month cohert analysis., you can use granularity as month, which gives month uniq users count in a month.

  1. Get all the users who visited/purchased in month Jan 2017.
  2. Get all the user who visited/purchased in month Feb filtered by user id obtained in Jan. M0
  3. Get all the user who visited/purchased in month Mar filtered by user id obtained in Jan. M1
  4. Get all the user who visited/purchased in month Apr filtered by user id obtained in Jan. M2

This is one way, which I can get. Please share your views.

Hi Gian,

Hope you’re well. My product development team is considering using Druid for our analytics pipeline and an important use case is behavioural cohort analysis. What I mean is we would like to examine how a user’s behaviour impacts their future usage of the platform. For this we’d like to create cohorts based on behavioural events and then be able to use those cohorts to analyse usage.

Here’s an example:

We have a three step on boarding process. We’d like to know if completing this onboarding early impacts future usage and retention. To understand this:

  1. We define a cohort of people who have completed the on-boarding within 7 days of using the product

  2. This cohort becomes a slice of our data

  3. We examine retention and usage metrics for the users who are in this cohort.

  4. We compare and contrast retention for the users who are not in this cohort.

Significant questions which I’m trying to answer are:

  • Would such kind of analysis be possible with Druid?

  • how/where would we define the cohort – does this need to happen before the data is loaded? Pre-computation at the data level does limit exploratory analysis

  • if the cohort can be defined dynamically, where does it get saved? do we require a side storage to store this kind of information?

Thanks a lot for your attention!



Following up here: retention analysis using time filtered theta sketches is possible out of the box now. We have an example on

Hey Seb,

The main two tools in Druid for doing this kind of thing are pre-computation and theta sketches. With pre-computation you would add a column that denotes whether a user is in a particular cohort or not and then filter/group on that column to compare cohort vs. control. Of course this has to be done in advance, and like you say, leaves you unable to define cohorts dynamically.

With the theta sketch approach ( you’d use their ability to do set operations and compute measures like: intersect(signed-up-in-week-X, onboarded-in-week-X, used-product-on-week-X+2) and so on. It’s approximate but gives you more flexibility to define cohorts dynamically, as long as you can express them as set operations like union, intersect, and difference. In this approach, the cohort is computed dynamically every query and doesn’t get saved anywhere.

Hi Gian,

Thanks for the detailed explanation, that solves our Calendar period cohorts. But…,

Is it possible to find the cohorts (Rolling Period) dynamically either using Javascript aggregator or Theta Sketch aggregator operation ?

Rolling Period is like - For a User let’s say X, came on June 1st, 2017, then June 1st to June 7th will be his w0 period, and a user Y who signed up June 6th, June 6th - June 13th will be his w0 period.

How do we find whether they made purchase (or any activity) on w1 which is June 8th - June 15th for user X, and June 14th - 21st for user Y?

Is it possible to do such kind of operations/ aggregations?

HI Gian, thanks a lot for your prompt reply. I think the theta sketch option based on set operations meshes well with how we think of cohorts. It’s great to know Druid supports this out of the box!

I’d like to get a better handle on the accuracy of the theta-sketch approach. I read through the datasketches doc which helped me understand that if the size of the sketch is 16384, the error would be about 0.007%. Over a column with cardinality 10MM it would be approximately 70K. Do you know how well this estimate bears out in practice?

Second, if I understand correctly, to persist the definition of a cohort (not it’s results) so that it’s simple to use with other queries, we’d need to build some tools ourselves. Possibly some kind of a query metadata store. Hope I’m following your note on the cohort not being saved anywhere correctly.

Thanks a lot for your assistance.