Analyzing real-time user activity

Hey, I’m trying to use druid to get some metrics on our user’s activity on a video course website. The data is fed to druid from Kafka and its schema is as follows:

{“course_id”: “course-v1:TsinghuaX+AP000005X+2016_T2”, “user_id”: “2864837”, “session_id”: “c3dcd5b8bd4ccbdd32c82eb22576a52e”, “activity_event”: “problem_save”, “time”: “2016-07-31T23:59:10”}

There are a total of 17 activity types a user can do:

Activity types
  • click_about

  • click_courseware

  • click_forum

  • click_info

  • click_progress

  • close_courseware

  • load_video

  • pause_video

  • play_video

  • problem_check

  • problem_check_correct

  • problem_check_incorrect

  • problem_get

  • problem_save

  • reset_problem

  • seek_video

  • stop_video

Here are the activities in one course:

My goal is to calculate some metrics based on this data:

  1. Number of active users
  2. Number of active users in each course
  3. Number of clicks on each part of the application in the last hour
  4. Number of clicks on each part of the application in the last hour of each course

I prefer not to do queries on the database and use features like Rollups and aggregations to solve the problem.

But my problem is how to define multiple rollups on the same dataset with different granularities(1s and 1h) and since I’m new to Druid, I’m not quite sure how to write the rollups in the first place.

Also besides the mentioned metric, there are some more sophisticated ones like the number of minutes the users have spent watching courses in the last hour, this can be achieved via an SQL query how every since we need these results in real time, I don’t think a plain query on the database is the right approach.

As my final problem, How can I ensure fault tolerance? say, for example, Druid server has failed for some reason and it takes a couple of minutes to restart, what happens to the lost data? how can I do recovery?

Relates to Apache Druid <0.23>

Welcome @Farzin_Nasiri!

Have you looked at creating multiple datasources? I’m thinking of the maximizing rollup ratio doc:

  • You can optionally load the same data into more than one Druid datasource. For example:
    • Create a “full” datasource that has rollup disabled, or enabled, but with a minimal rollup ratio.
    • Create a second “abbreviated” datasource with fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the “abbreviated” set, use the second datasource to reduce query times. Often, this method only requires a small increase in storage footprint because abbreviated datasources tend to be substantially smaller.

Regarding writing rollups, have you worked through this tutorial?

Another way to experiment without touching your own dataset is to use the wikipedia data. You can load it in by following the quickstart, and, if you do that, you’ll see a specific reference to rollups within the console, along with a picture:

You can turn rollup on and off through the console, and see the effects if you click on Edit spec.

Hopefully that can get you started, and hopefully others will chime in regarding your questions.