Here is the data set that I have :
T User C S L
t1 a1 c1 s1 l1
t2 a1 c2 s1 l1
t3 a2 c3 s2 l2
t4 a3 c4 s3 l3
t4 a4 c4 s3 l3
t4 a5 c1 s4 l4
Where T is timestamp and all others are Dimensions.
Now objective I want to achieve is Filter over C to GroupBy over S and L and aggregate Probability Mass Function (PMF) distribution over A
- Eg : With Filter c1
- GroupBy over S & L
- So with every combination of S & L, how many users occurred once, how many occurred twice and so on
Theoretically I can achieve this with ArrayOfDoublesSketch. At pre-computation time, I will build this sketches with ArrayOfDoublesSketch(User, Array(1.0)) with all users over each combination of C, S and L into a Sketch and also maintain map of which sketch belongs to which combination. Then at query time with given time range and filter over C, I will union all S & L combination sketches over different C and then can get PMF over that sketch.
How can I achieve this in Druid ? With what aggregations do I ingest the data ? And finally at query time with what aggregators and postAggregators to use ?
I tried with TupleSketch(arrayOfDoublesSketch) as aggregator
Aggregator config :
Where metricColumn double is introduced in data with value 1.0 for all rows as creating sketch from raw data requires a metric column which will be included in Sketch’s double array.
Then at query time filtered over C, GroupBy over S & L, tried to postAggregate with arrayOfDoublesSketchToQuantilesSketch and quantilesDoublesSketchToHistogram to achieve PMF distribution that is required. But query didn’t gave any postAggregate results.
In fact simple arrayOfDoublesSketchToEstimate was also not working. No query error just empty aggregations results at each groupBy values.
What am i doing wrong ? Am I wrongly aggregating at ingestion or wrong postAggregations at query time ? Is the aggregation that i want is achievable in druid ?
Any help is appreciated. Thanks.