This topic needs a title

lets say I have two columns in Druid: Campaign and UserId (HLL aggregator). Is it possible to estimate number of unique users which have seen both Campaign 42 AND Campaign 47? I suppose it’s not possible because for this type of query one would need HLL intersection and there are no good algorithms for it, right?

Or is there some trick for computing it?

Lukáš Havrlant

A intersect B| = |A| + |B| - |A union B|

so you could estimate with three queries: Campaign 42 + Campaign 47 - (Campaign 42 or 47)

But if the result is much smaller than the total of either individual query, you’re going to be significantly off.

I’m really curious now if it would be worth adding in the capacity to do a dimensional cross product for groupBy. If, for example, you could take the values for the campaign dimension (filter by 42 or 47) and do a cross product group by with a Cardinality aggregator (assuming you store the original UserID as a dimension) I think you could get the answer. Such a query does not exist currently though.