I am currently in the process of using druid.io to develop a click-stream analytics web app, primarily for learning purposes. The basic setup is that I have a fast-cgi application that takes care of data acquisition by creating the initial json data on the filesystem of my vps. Then, at regular intervals, the data is sent to druid via post request. So nothing too fancy.
As I don’t have deep experience in this field, however, I was wondering if anyone knows about best practices regarding design of the ingestion spec specifically for click-stream analytics? I have gone over the tutorial for writing an ingestion spec, and it’s proven really useful. But I can’t seem to find resources more specifically tailored to what kinds of data most clickstream analytics apps are usually considered most valuable to track.
Now, I’m sure I could muddle through and come up with an ingestion spec on my own. Things like ip address, region of the user, which page was accessed, etc make sense to track. And I actually am going to do that for right now. But I’m interested in resources that approach the theory of clickstream analytics and best practices from a more general deep-dive perspective than what I can just come up with at the top of my head, if that makes sense.
If anyone has any thoughts on this, I would be most appreciative.
Oh, and sorry about my username and email. I’ve actually had it for a long time, before I even new about Apache Druid, haha.
Enriching visitor data early – preferably upstream if it’s complicated
Generate and enrich session data upstream as a second source – it makes queries faster
Front-load common expressions in the ingestion spec using transforms - like regular expressions
The most common architecture I’ve seen, incidentally, is app → Kafka → Druid, rather than batch pushes. I’d highly recommend looking into that as you then get the real-time-ness and good stuff from stream processing.