Using theta sketch for counting ad impressions and unique users


We’re currently serving 250 Billion ad impressions per day across our 6 data centers. Out of these, we are serving about 180 Billion ad impressions in the US alone.
Each ad impression can have hundreds of attributes(dimensions) e.g Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, ad-id, site-id etc

Currently, we don’t have a data warehouse and ad-hoc OLAP support is pretty much non-existent in our organization. This severely limits our ability to run adhoc queries and get a quick grasp about data.

We want to answer the following 2 queries to begin with :-

Q1) Find the total count of ad impressions which were served from "beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... .. Dimensionk = d_k

Q2) Find the total count of unique users which saw our ads from "beginDate" to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 .... .. Dimensionk = d_k

As I said each impression can have hundreds of dimensions(listed above) and cardinality of each dimension could be from few hundreds(say for dimension Country) to Billions(for e.g User-id).

We want approximate answers and the least infrastructure cost and query response time within < 5 minutes. I am thinking about using Druid and Apache datasketches(Theta Sketch to be precise) for answering Q2 and using the following data-model :-

Date Dimension Name     Dimension Value        Unique-User-ID(Theta sketch)

2021/09/12   "Country"        "US"             37873-3udif-83748-2973483

2021/09/12   "Browser"       "Chrome"          37873-3aeuf-83748-2973483
<Other records>

So after roll-up, I would end up with 1 theta-sketch per dimension value per day(assuming day level granularity) and I can do unions and intersections on these sketches to answer Q2)

I am planning to set k(nominal entries) to 10^5(please comment about what would be suitable k for this use case and expected storage amount required?)

I’ve also read that the about theta sketch set ops accuracy here

I would like to know if there is a better approach to solve Q2(with or without Druid)

Also I would like to know how can I solve Q1?

If I replace Unique-User-Id with “Impression-Id”, can I use the same data model to answer Q1? I believe that if I replace Unique-User-Id with “Impression-Id” then accuracy to count the total impressions would be way worse than that of Q2, because each ad-impression is assigned a unique id and we are currently serving 250 Billion per day.

Please share your thoughts about solving Q1 and Q2.


KARTIK! This is pretty much why Druid was created in the first place … way back when the Spice Girls were fashionable.

Yes – use datasketches. And as for changing the K values – my frank answer is to not bother. The defaults in Druid have been selected based on a balance of storage size and accuracy. Change them and they either get too big or too inaccurate.

If you only need to do COUNT DISTINCT, then use HyperLogLog. If you need to do set intersections, use ThetaSketches. You may even want to have both set up as metrics.

Re: funnel analysis, by the way, as you have ThetaSketches for set operations, but to use them for ordered funnels (A → B → C) you will then of course need to be sure that your incoming events are also in implicit order. Option 2, have dims for columns A, B, C with a 1 or a 0 and then just SUM them to find out who got to what stage – that’s very common for counting, say impressions versus clicks.

Hope that’s useful…

The accuracy of the sketch for a k will depend on the number of uniques in your data. I would use THETA_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS in druid SQL to get the 3 stddev error bound and make sure that is acceptable.

1 Like