Does anyone know a way to do a cumulative distribution function in Druid? I am trying to find out:

How many new users were added to our system on a daily basis

How many total users we have in the system by day.
Does anyone know a way to do a cumulative distribution function in Druid? I am trying to find out:
How many new users were added to our system on a daily basis
How many total users we have in the system by day.
Hi Pratik,
For Cumulative Distribution Function, please take a look at:
http://druid.io/docs/latest/development/extensionscore/datasketchesquantiles.html
Eyal.
Thanks for the link Eyal. I had looked into it earlier. Cumulative distribution does not seem to be available there. I also installed Apache Superset and tried to use the cumsum function they have, but it does not push down the uniques calculation for the time slices to Druid, so it calculates the cumsum incorrectly for HLL columns.
~Pratik
I tried using the following function in the custom measure in Pivot, UI… but i get an error “Should not call getJS on External”
$main.sum($main.countDistinct($my_uniques_metric))
Its interesting to note that Pivot UI is calculating the “Overall” row correctly for hyperUniques when setting the “Show totals” check box.
So, Pivot does know how to do the merges, just that it does not do it in a way that forms a cumulative sum.
In the below case, the uniques sum for 4/18 upto 4/21 does not add up to 579 k (overall total) which is the expected behavior.
I am not sure I understand the question. It seems to me that a few different things are mixed here.
Total number of distinct users and number of new users has to do with set operations. Cumulative distribution function is entirely different thing.
For distinct count problem and set operations I would suggest looking at the Theta sketch aggregator.
In particular, one can perform a set difference between today’s users and yesterday’s users to estimate the distinct count of users who appeared today, but not yesterday.
Or instead of yesterday one can use a union of daily sets for last week or month depending of the definition of “new user” (user not seen before, where “before” means last week or month or some other period  it is hard to look back indefinitely)
I don’t quite get what does this have to do with cumulative distribution.
Thanks for the details and the pointer to theta sketches post aggregation.
~Pratik
Can i use thetaSketch aggregator even if my input column is of type hyperUnique during ingest time?
Thanks,
~Pratik
No, these are entirely different data structures.
Oh okay, then i guess i need to set them as theta sketches during ingest time?
Or would it be sufficient to set “isInputThetaSketch”: false while querying?
Thanks,
~Pratik
I am not sure I am following.
If your metric column in a segment contains HLL sketches, you cannot just reinterpret it as Theta sketches. They are completely different algorithms.
Another thing is how does one create such metric columns. Yes, one way would be to do it at ingest time by converting user IDs into Theta or HLL sketches.
If you created Theta sketches, then at query time you can do unions, intersections and set differences. If you have HLL sketches, you are stuck with unions only (simply speaking).
I mean, does hyperUnique column (set during ingest time) uses thetaSketches?
If not, is there a way to specify a column to use theta sketches during ingest time?
Thanks,
~Pratik
I now understand what you were saying. So basically i need to have the theta sketch column prebuilt into my input data source from Hive and ingest it directly into druid as a regular column. Then druid can during aggregation make use of the theta sketch column once the plugin is enabled.
~Pratik
No, that is not what I am saying. HyperUnique and ThetaSketch are two completely different aggregators based on incompatible algorithms and data structures.
Both provide approximate distinct counting. Both provide merging (computing set union). Theta sketches also provide intersection and set difference.
You can build Theta sketches outside of Druid (in Hive or Pig) and ingest sketches, or you can build sketches at ingestion time from a column with unique identifiers, or you can build sketches at query time.