Is path analysis (funnel analysis) possible with druid?

I’m currently working on a web analysis framework and i’m certain druid is the perfect choice for most the metrics i want to calculate, but what about path or funnel analysis?

From what i’ve been reading, this is the usual approach of calculating paths:

https://segment.com/blog/building-ultimate-funnel-sql/

Would this be possible with druid? If not, is it possible with another part of my stack such as Apache Storm or Cassandra?

Sorry if this is a silly question, databases is a new territory for me.

Let me know if you need further info, as i may have not explained my problem well.

Thanks

Lewis

Yes, several folks do funnel analysis with Druid in production. We are adding support to make funnel analysis better supported in Druid. Look for announcements in the near future.

Brilliant, although i’m still unsure how i would go about doing it. Could someone give me an example or put me in the right direction? Thanks

Hey Lewis,

That suggested funnel schema (one row per session, with dimensions for did_step_one, did_step_two, etc and metrics for count_step_one, count_step_two, etc) would work great in Druid. So would the path schema (one row per session, with dimensions for event1, event2, etc). A third option that would also work is a path schema where instead of breaking out event1, event2, etc, you could actually join them into a single string (like “event1,event2”) and store them in a single dimension called “path”. One nice thing you can do there is a topN query on “path” to get the most popular N paths for some other filter set (all users, or users on a particular device, perhaps).

The trick in all three cases is materializing this data within Druid. Druid will be able to rapidly query these schemas but it is not the best tool for generating them. For that, you can use a pre-Druid step that groups your events by session id and outputs a single row per session, with the appropriate derived dimensions, to Druid. In batch you could do this with Hive, Pig, Spark, or any other batch processing language. With streams you could do this with a stream processor like Storm, Samza, or Spark Streaming.

Hi Fangjin, Is this funnel analysis specific query supported in Druid now. As of now, it looks like we need to pre process the data and ingest to druid. In that case druid just becomes the storage for counts/max/min etc measures for these dimensions.

Is there any better alternative available ?

This is a very key requirement for typical realtime event stream analytics space.

We are thinking of doing this analysis in Spark, in that case the question comes - if we are pre-processing data in spark / flink(which provides excellent dynamic window capability), why can’t we store the output to any other noSQL stores. i know druid will have an edge because of its Cube capabilities across multiple combinational slicing and dicing.

Regards

-Sambit

Hey Sambit,

Most people doing funnel analysis in Druid are either pre-processing to some degree (generally grouping events into user sessions and then including dimensions for interesting properties of the session) or using theta sketches (http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html). Theta sketches can answer questions like “how many users have done X and Y but not Z” even on a datasource that is raw events, not grouped into sessions.

Hi there,
I would like to bump this topic.

I’m looking for the most performant solution to provide interactive pattern matching on series of events of some actors. A very common use case nowadays - mentioned in this topic - is behavioral analytics, like path/funnel analysis (e.g.“find users that viewed the product X, then added it to the cart, then purchased it”) . After (outstandigly) successful use of Druid in few projects for OLAP-like analytics, I’m brainstorming if Druid could be applied for my case.

Let me raise my concerns:

  • Correct me if I’m wrong, but apparently, there is no functionality currently that allows to look for patterns across multiple events. Fanjin wrote above that:

We are adding support to make funnel analysis better supported in Druid. Look for announcements in the near future.

, but I think that it didn’t happen yet, has it?

I’ll be grateful for any comments in this matter.

Cheers,

Krzysztof

czw., 4.08.2016 o 18:19 użytkownik shaikidris.ali@olacabs.com napisał:

Hey Krzysztof,

Druid still requires that segments are partitioned on time (according to segmentGranularity) and that within each segment, __time must be the first field in the sort order. But you could work-around that by doing:

  1. at indexing time, set queryGranularity (truncation for __time) equal to segmentGranularity, which will make all __time values in a segment the same

  2. at indexing time, configure Druid to partition your data by actor_id, which will cause all rows for the same actor_id within a segmentGranularity time bucket to end up in the same segment

  3. optionally, ingest another dimension like “actual_time” containing a non-truncated __time

  4. set the dimension sort order to (actor_id, actual_time)

Now, all of your rows within a segment will be sorted by (actor_id, actual_time). This has a couple of downsides:

  1. timeseries queries won’t be able to use your actual_time field

  2. the actual_time dimension must be stored as a string (although in the future we will support int and long dimensions)

But it may still be good enough for some use cases. In particular, you could write a custom Aggregator, and that Aggregator will get the rows for each segment in order (and therefore will see the rows for each actor_id all at once, and can optimize for that). This is just within the bounds of one segment, though. If you need to analyze rows across different segmentGranularity buckets, then things could get more complicated.

Does that sound potentially useful? I’m actually really interested in what we can do to help people do more kinds of behavior analysis with Druid, so would love to hear what you have to say!

Hi Gian,
That’s an interesting idea, thanks for the writeup! :slight_smile: It definitely could work for a lot of cases. But I believe the performance of that solution would be inferior to a case-optimized system , which I would like to build. One of the reasons is that actual_time column would be very high cardinality (different value for each row) with an extreme dictionary, so compression of that would be terrible. The right encoding of that column would use advanced integer compression ideas significantly lowering the size of it. Of course, the on-going work to support int/long dimensions in Druid might include optimizing column like that.

So I assume, that Druid in current shape with applying workarounds like yours would not be optimal for my case. Let me know if you disagree.

But let’s continue brainstorming:

I see and appreciate Druid from two different angles:

  1. As a very well designed distributed system with no-compromise fault-tolerance, scalability, with real-time ingestion, tiered historical storage, caching and many more.

  2. As a database optimized for OLAP-like processing and extremely good at it.

Now, what I would like to get is a system that fulfills 1), but instead of 2) is optimized for behavior analysis / pattern matching. That would probably mean redesign or extend the segment format, so that:

a) events are ordered first by actor_id, then by time. Series of each actor are analyzed independently (well, in most cases, let’s stick to that).

b) data is very-well compressed, leveraging properties of the case, especially the time column ( per-actor monotonically increasing integers). Lowering the size is so important to read data faster, to keep more data in memory and/or in cache.

c) it is column-oriented - I bet columnar storage would outperform row-oriented for the case of wide events.

I understand, that Druid is not there yet with all the points at once. Now I see two options:

  1. To implement my own database reimplementing the features of Druid appreciated in “angle” 1 above.

  2. To gut Druid storage layer and implement specialized version for my type of analyses. I see that there is already work planned to generalize Druid storage layer and support this option [1](https://github.com/druid-io/druid/issues/2965).

What do you think? Is it feasible to extend Druid to match path analysis case? Do you think it’s a lot of work? Has anyone done such thing in past (maybe for other use case)?

Thanks a lot for any advice,

Krzysztof

P.S. From the usability POV, to better understand, what type of workload / functionality I would like to get, I feel its good to compare it to existing systems. Some of them are: “Pattern Matching” feature of Oracle DB [2](https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8963), Teradata Aster nPath [3](https://developer.teradata.com/aster/articles/aster-npath-guide).

P.P.S. I incidentally haven’t mentioned anything about the language for querying behavioral data and the lack of it in Druid, as this problem is inferior to the right storage design and can be solved later.

pt., 16.09.2016 o 00:12 użytkownik Gian Merlino gian@imply.io napisał:

Hi,Fangjin:

as you mentioned that four years ago:

"We are adding support to make funnel analysis better supported in Druid. Look for

announcements in the near future."

Time flies and four years have passed.Now,we want to know "how many users have done X

followed by Y" rather than simply “X and Y” in funnel analysis. I am not sure if we can

achieve this with theta sketches or other data sketch type or we must pre-processing data

in spark/flink.

Can you give us some advice?

Looking forward to your reply.

在 2015年10月19日星期一 UTC+8上午1:03:27,Fangjin Yang写道:

The current set of capabilities is described here: https://imply.io/post/clickstream-funnel-analysis-with-apache-druid

Hi Fangjin:
Thx for your help.

We have read the content in this page: https://imply.io/post/clickstream-funnel-analysis-with-apache-druid

Maybe it can not help us.

As I have mentioned :

“we want to know “how many users have done X followed by Y” rather than simply “X and Y” in funnel analysis.”

In our case,there is a ordered funnel:view page1–view page2----view page3. And in this funnel,visitors should do these things in order and they should complete all these operations in one day.

Maybe,the configuration below can only support quering the distinct users who “viewed the landing page” and “clicked download resource” rather than those who “viewed the landing page” first and then “clicked download resource”.That is to say, the configuration below also applies to query the visitors who clicked download resource" first and then viewed the landing page " which is not what we expect.

Thx for your time.

Looking forward to your reply.

“postAggregations”: [

    {
      "type": "thetaSketchEstimate",
      "name": "users_email_provided_count",
      "field":
      {
        "type": "thetaSketchSetOp",
        "name": "users_email_provided_count",
        "func": "INTERSECT",
        "fields": [
          {"type": "fieldAccess","fieldName": "users_viewed_landing_page_count"},
          {"type": "fieldAccess","fieldName": "users_clicked_download_resource_count"}
        ]
      }
    }
  ],

Fangjin Yang fangjin@imply.io 于2019年11月29日周五 上午4:31写道:

Hi Fangjin:
Thx for your help.

We have read the content in this page: https://imply.io/post/clickstream-funnel-analysis-with-apache-druid

Maybe it can not help us.

As I have mentioned :

“we want to know “how many users have done X followed by Y” rather than simply “X and Y” in funnel analysis.”

Maybe,the configuration below can only support quering the distinct users who “viewed the landing page” and “clicked download resource” rather than those who “viewed the landing page” first and then “clicked download resource”.That is to say, the configuration below also applies to query the visitors who clicked download resource" first and then viewed the landing page " which is contrary to the operation sequences of the funnel we defined.

In our case,there is a ordered funnel:view page1—view page2----view page3. And the visitors who exactly view page1 first and then view page2 are match with the funnel.That is to say ,the visitors who view page2 first and then view page1 should not be calculated in the funnel analysis.

Can we achieve the ordered funnel analysis with theta sketches or other data sketch type or we must pre-processing data in spark/flink?

I may have not explained the problem we encountered in using druid to implement an ordered funnel clearly.If you need more informations please let me know.

Thx for your time.

Looking forward to your reply.

Ps: part of the configuration shown in the page:https://imply.io/post/clickstream-funnel-analysis-with-apache-druid

“postAggregations”: [

    {
      "type": "thetaSketchEstimate",
      "name": "users_email_provided_count",
      "field":
      {
        "type": "thetaSketchSetOp",
        "name": "users_email_provided_count",
        "func": "INTERSECT",
        "fields": [
          {"type": "fieldAccess","fieldName": "users_viewed_landing_page_count"},
          {"type": "fieldAccess","fieldName": "users_clicked_download_resource_count"}
        ]
      }
    }
  ],

在 2019年11月29日星期五 UTC+8上午4:31:22,Fangjin Yang写道:

Is there any update on this? We are also evaluating druid for some work and an ordered funnel is a must.

Hi Tarun,

Please see if this helps. https://imply.io/post/clickstream-funnel-analysis-with-apache-druid

Thanks

Hey Tarun,
I hear your pain.

We (at Nielsen) actually had similar needs, so we took the advice mentioned earlier in this thread, and combined Spark and Druid to solve the use case of funnel analysis.

Moreover, I gave a talk about that at our last week meetup with Imply - see slides and video.

Let me know if there are any follow-up questions.

Thanks,

Itai

Hello Tijo,

Thanks this has the same issue of not honouring the order of events.

Hi Tarun,

Yes, This works if the event follow the order.

Or else you may need to apply the logic on how to handle the out of order events out side Druid . Eg in spark streaming or kstream.

Thanks

I am not sure how that works? How do sketches ensure ordering if segments are ordered?

Tarun,
I’ll try to elaborate, because I think your use-case is similar to ours.

To quote your question:

"In our case,there is a ordered funnel:view page1—view page2----view page3. And the visitors who exactly view page1 first and then view page2 are match with the funnel.That is to say ,the visitors who view page2 first and then view page1 should not be calculated in the funnel analysis.

Can we achieve the ordered funnel analysis with theta sketches or other data sketch type or we must pre-processing data in spark/flink"

So the answer is, to the best of my knowledge - you can’t achieve the ordered funnel analysis with Theta Sketch alone.

If you want to ignore cases where visitors view page2 first and then view page1 (i.e - not calculate them in the funnel analysis), you need to pre-process the data in Spark/Flink.

That’s why I mentioned the meetup we had with Imply a few weeks ago, where I talked about this use-case specifically - see slides and video.

I hope this will provide more clarity on the subject.

Thanks,

Itai