Use of Tuple Sketches

The number one reason I want to use Druid is to avoid storing high-cardinality dimensions (e.g. UserID, etc.).

However, there are times that we need to retain some information that are dependent on these dimensions as well (e.g. Avg. Metric per User).

This is where I stumbled upon Datasketch Tuples. I am wondering if this is the appropriate thing for me to use for my use case.

It was also pointed out in one of the old posts in this group to refer to Tuple Sketch Overview in the DataSketches website, but I could not really confirm if my assumptions are correct.

As per my understanding, Tuples Sketch allows us to associate an array of Double values to a unique key. Sort of extending what information we can store alongside the DataSketch of the high-cardinality dimension.

This in turn would allow me to approximately compute for each element in the array (e.g. I want to compute the Mean value of each element on the array, given X unique keys).

My question now is if I can do computation among the elements in the array. See example below.

Example:

I have a dataset containing Date, User ID, Total Sales, Total Transactions. My objective is to retain some information on per User ID basis.

Thus:

  • Date will serve as the __time dimension
  • User ID will be the fieldName of the Tuple Sketch
  • Total Sales and Total Transactions would serve as the metricColumns of the Tuple Sketch

I am assuming that with this setup, I can compute for the Average of the Average Transaction Value (ATV) per User by computing the following:

atv_per_user = SUM(Total Sales) / SUM(Total Transactions) PER USER
output = AVG(atv_per_user)

I just can’t piece it together when adding the fact that we are dealing with a Tuple Sketch in the process. A Druid SQL that would allow me to do this would be greatly appreciated.

Thanks!

Hi Von,

Based upon your example, a Theta Sketch would be more appropriate for you to approximate the users. This would enable you to calculate your sales per user for any granularity period (all, day, hour, etc.) and also perform set based operations on them such as comparing sales of different types of users. The Druid sketch documentation is very thorough but well worth the time and effort to fully understand how they work. They have some very good examples including defining a metric specification for ingest as well as query examples using aggregations/postAggregations using native queries. When sketches are used in conjunction with rollups, they’re a powerful combo. Lastly, I prefer the native JSON queries since they give you more control over how you summarize your data. Hope this helped.

Hi J B

The Druid sketch documentation is very thorough but well worth the time and effort to fully understand how they work.

I already read through that but I still can’t connect how it should all work in SQL. I have a limitation that it should be done in SQL since we are using Superset.

I do know that there are sketch-related functions listed in the Druid SQL section, but SQL code would be nice given my example.

Apologies if I am not as adept as you are.

Now that you’ve mentioned Theta Sketches, I actually have my User ID already in Theta Sketch. Does that mean I can invoke set operations and it would somehow determine sales per user? How does that work?

Does Druid take note of how the metric dimensions are distributed per user when it does a rollup?

Thanks for the reply.

We currently use druid version 0.12.1, so, if you’re using a more recent version, there’s more powerful functions available for sketches, such as DS_THETA. We use to use Superset as well, but, we ended up switching to using Turnilo, which is an open source version of Imply Pivot. It handles sketches without the need for SQL. Extremely powerful for slicing/dicing.

At any rate, for druid version 0.12.1, here’s some examples to get you started:

‘all’ granularity:

SELECT APPROX_COUNT_DISTINCT(user_id) user_id FROM your_datasource where __time >= TIMESTAMP ‘2020-05-13 09:00:00’

‘hour’ granularity:

SELECT floor(__time to hour) series_hour, APPROX_COUNT_DISTINCT(user_id) user_id FROM your_datasource where __time >= TIMESTAMP ‘2020-05-13 09:00:00’ group by floor(__time to hour)

For druid version > 0.12.1, replace the above APPROX_COUNT_DISTINCT with DS_THETA:

In terms of rollup at ingest, users would be sketched according to your query granularity. Let’s assume your ingest query granularity is ‘hour’ and your data had this distribution for one day:

Hour 0: Users A, B, and C (theta sketch count would be 3)

Hour 1: Users B and D (theta sketch count would be 2)

Hours 2 - 23: User C (theta sketch count would be 1 for each of these hours)

However, if you have a timeseries query with a granularity ‘day’, the count would be 4. So, the granularity (‘all’, ‘hour’, ‘day’, etc.) will define how the sketches are merged to produce the approximate unique users for that granularity.

Hope that helped.

Hi J B,

Thank you for your examples, but my use case is beyond what you presented. I understand that a Theta Sketch can do approximate distinct counts. In fact, I am already using this.

What I need is to somehow associate some metric values per unique key in a sketch.

Based on how I understand Tuple Sketch, it seems it can store an array of Float values ALONGSIDE the unique keys being converted into a sketch.

Why do I need to store that information? Our business has metrics that is on a per-user basis and I am wondering if this could be derived from a sketch.

I will use your example for more clarity:

What if instead of just counting unique users over a given time period, you also want to compute the AVERAGE DURATION PER USER on given time period.

You example won’t cut it since if I store my data on Druid like this:

__time --> time dimension

user_id --> Theta Sketch

duration --> doubleSum metric

The distribution of duration each user spent will be lost and would just be instead summed up to the lowest granularity of the __time column.

As a consequence, you can’t answer questions such as, what is the quantile distribution of average duration spent per user? This can’t be just a simple case of SUM(duration) / DS_THETA(user_id) over a given period of __time.

Now, I know that Tuple Sketches can do this. My question is if it’s possible to do some computation between the stored array values.

Hope this clarifies what I wanted to do.

you can’t answer questions such as, what is the quantile distribution of average duration spent per user?

You can use DS_GET_QUANTILES(duration,0.5,0.9) can to get median, 90 percentile durations of all users.
If you have some categories of users like group1, group2, then you can get the quantiles for those groups.

However If you want to get data for a specific user 123, You may have to store the user_id as dimension.

I have a limitation that it should be done in SQL since we are using Superset.

You can enable Native druid with superset, if you need more flexibility with post-aggs

https://superset.incubator.apache.org/installation.html#druid