We’re currently serving 250 Billion ad impressions per day across our 6 data centers. Out of these, we are serving about 180 Billion ad impressions in the US alone.
Each ad impression can have hundreds of attributes(dimensions) e.g Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, ad-id, site-id etc
Currently, we don’t have a data warehouse and ad-hoc OLAP support is pretty much non-existent in our organization. This severely limits our ability to run adhoc queries and get a quick grasp about data.
We want to answer the following 2 queries to begin with :-
Q1) Find the total count of ad impressions which were served from "beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... .. Dimensionk = d_k Q2) Find the total count of unique users which saw our ads from "beginDate" to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 .... .. Dimensionk = d_k
As I said each impression can have hundreds of dimensions(listed above) and cardinality of each dimension could be from few hundreds(say for dimension Country) to Billions(for e.g User-id).
We want approximate answers and the least infrastructure cost and query response time within < 5 minutes. I am thinking about using Druid and Apache datasketches(Theta Sketch to be precise) for answering Q2 and using the following data-model :-
Date Dimension Name Dimension Value Unique-User-ID(Theta sketch) 2021/09/12 "Country" "US" 37873-3udif-83748-2973483 2021/09/12 "Browser" "Chrome" 37873-3aeuf-83748-2973483 . . <Other records>
So after roll-up, I would end up with 1 theta-sketch per dimension value per day(assuming day level granularity) and I can do unions and intersections on these sketches to answer Q2)
I am planning to set k(nominal entries) to 10^5(please comment about what would be suitable k for this use case and expected storage amount required?)
I’ve also read that the about theta sketch set ops accuracy here
I would like to know if there is a better approach to solve Q2(with or without Druid)
Also I would like to know how can I solve Q1?
If I replace Unique-User-Id with “Impression-Id”, can I use the same data model to answer Q1? I believe that if I replace Unique-User-Id with “Impression-Id” then accuracy to count the total impressions would be way worse than that of Q2, because each ad-impression is assigned a unique id and we are currently serving 250 Billion per day.
Please share your thoughts about solving Q1 and Q2.