[druid-user] ClickStream Analytics

Hello,

I am currently in the process of using druid.io to develop a click-stream analytics web app, primarily for learning purposes. The basic setup is that I have a fast-cgi application that takes care of data acquisition by creating the initial json data on the filesystem of my vps. Then, at regular intervals, the data is sent to druid via post request. So nothing too fancy.

As I don’t have deep experience in this field, however, I was wondering if anyone knows about best practices regarding design of the ingestion spec specifically for click-stream analytics? I have gone over the tutorial for writing an ingestion spec, and it’s proven really useful. But I can’t seem to find resources more specifically tailored to what kinds of data most clickstream analytics apps are usually considered most valuable to track.

Now, I’m sure I could muddle through and come up with an ingestion spec on my own. Things like ip address, region of the user, which page was accessed, etc make sense to track. And I actually am going to do that for right now. But I’m interested in resources that approach the theory of clickstream analytics and best practices from a more general deep-dive perspective than what I can just come up with at the top of my head, if that makes sense.

If anyone has any thoughts on this, I would be most appreciative.

Oh, and sorry about my username and email. I’ve actually had it for a long time, before I even new about Apache Druid, haha.

Thanks for your time,
DruidPeter

Hi Druid_Peter,

Welcome to the Apache Druid community.
I love your user ID, but my colleague @petermarshallio may have a different opinion :wink:

Here are a set of blog posts on the subject, I hope you find them useful:

Sergio

Hi Druid_Peter,

Welcome to the Apache Druid community.
I love your user ID, but my colleague @petermarshallio may have a different opinion

Here are a set of blog posts on the subject, I hope you find them useful:

Imply clickstream

Imply is a full-stack, multi-cloud modern data platform pioneering analytics in motion, built around Apache Druid, a widely-adopted open-source OLAP database.

Sergio

Personally I’m loving your username. :smiley: :smiley:

To get to the core of what you’re asking, no there’s not a published “standard ingestion spec” for clickstream I’ve seen out there per se, but do check out Real time analytics: Divolte + Kafka + Druid + Superset - GoDataDriven and https://imply.io/blog/clickstream-analysis-open-source-divolte-kafka-druid/

Some other things to think about:

  • Enriching visitor data early – preferably upstream if it’s complicated
  • Generate and enrich session data upstream as a second source – it makes queries faster
  • Front-load common expressions in the ingestion spec using transforms - like regular expressions

The most common architecture I’ve seen, incidentally, is app → Kafka → Druid, rather than batch pushes. I’d highly recommend looking into that as you then get the real-time-ness and good stuff from stream processing.

Some other things you may find over in Druid Forum:
https://www.druidforum.org/search?q=clickstream