Cumulative union for hyperUnique metric in Druid 0.9.0

I’m using the hyperUnique aggregator for number of unique users. Besides that, I also want to be able to query for the cumulative number of unique users over the chosen time interval and granularity. Is there a way to achieve that in druid 0.9.0?

Jesal, what do you mean by cumulative number?

By cumulative, I mean a running union of the hyperUnique metric of unique user id’s. For example, if I’m running the query for the time interval April 1-5 with a granularity of 1 day, then:

April 1: cumulative number of unique user id’s = number of unique user id’s

April 2: cumulative number of unique user id’s = number of (unique user id’s on April 1) set union (unique user id’s on April 2)

April 3: cumulative number of unique user id’s = number of (unique user id’s on April 1) set union (unique user id’s on April 2) set union (unique user id’s on April 3)

… and so on.

I was wondering if there’s a way to return the unique users as well as the cumulative unique users using a single query.

Hi Jesal, Druid doesn’t currently support this type of query. As a workaround, you can simply fire off multiple queries in succession with increasing intervals. With caching enabled it probably won’t be too slow.

Supporting cumulative timeseries queries is something I’ve heard before. If there isn’t an issue about it yet there probably should be.

@Xavier Thanks for clearing that up.

@charles.allen I agree. It needs a more elegant approach with support for hyperUnique and thetaSketch aggregations. From what I understand, firing multiple queries sort of defeats the purpose of exploiting druid’s performance benefits.

Hi Jesal,

Firing multiple queries is often recommended in Druid as each query in Druid is designed to complete relatively quickly.

For example, https://github.com/implydata/pivot fires multiple queries for many of its visualizations.

i cannot fetch raw thetasketch(or hyperloglog) data structure from timeseries query to do set operation at client side,only get estimate number.

could you give me some help?

If you pass “finalize”: false in the context of the query, then you should be able to get the raw data structure out.

Hi Fangjin,

I agree that the query runtimes in Druid are quite short so as long as I don’t need to cumulate over a wide time range or over a small granularity it wouldn’t really be a problem combining the results at my end. Thanks for your input!

thank you very much,i'll try it.

Did that work for you?

Its interesting to note that Pivot UI is calculating the “Overall” row correctly for hyperUniques when setting the “Show totals” check box.
So, Pivot does know how to do the merges, just that it does not do it in a way that forms a cumulative sum.

In the below case, the uniques sum for 4/18 upto 4/21 does not add up to 579 k (overall total) which is the expected behavior.

In am not well versed with Druid internals, but i suppose while combining the hyperUnique columns, if we maintain a count of the “delta” uniques, we can achieve

i) Incremental uniques by time granularity and

ii) Cumulative sum of uniques by time granularity.

So, when combining the uniques for time granularity A and B… if we maintain uniques delta between A & B in an in-memory attribute, we can achieve the above.

Any pointers on the code implementation?