What do you think of this design for randomly sampling an interval for rows?
http://codepen.io/markduane/full/9e5e2d7fff410b9e49d2213621181fe4/
What do you think of this design for randomly sampling an interval for rows?
http://codepen.io/markduane/full/9e5e2d7fff410b9e49d2213621181fe4/
Hi Duane,
i am not sure i am getting the idea very well. So when you said you add sample_i is it at ingestion time ?
If so i guess it is a ingestion time, i am having trouble understanding how this can be true outside a given interval.
For instance suppose you do your sampling based on a 15 Minutes interval and suppose we only have 2 intervals.
So assume at interval 1 we have seen only event A so getting 50% of the population which contains only A. But at interval 2 we have now an uniform distribution of A,B,C,D as population. so asking 50% of Interval 2 will probably return a uniform sample of A,B,C,D.
Then the curx IMO how you can compute the sample of 50% for Inteval1+2 ?
Thanks Slim,
So I should state that this would be a “select” query that matches these filters, and each row would have different settings for the sample_i dimensions. If each row contains multiple events then it might be possible to do a variation on this idea and weight the sample_i dimensions by the number of events contained in each row. Does that make sense?
Also, yes this would be at ingest time.
So assuming interval 1 contains 8 total matching rows (all of which are A), and interval 2 contains 8 matching rows (2 of each A-D), then requesting 50% of both intervals would return each of the 16 rows with 1/2 probability, on average some selection of 8 rows.
i am not sure this make sense or not but here is my understanding
assume that my data contains (time_visit User_ID, User_age) now assume i would like to know what is a representative 50% sample of the ages.
If we keep the same example, like we have only users with age 15 logged on at interval 1 (late midnight ) and user with age 15/ 45 /65 /95 at interval 2(working hours). So if you apply filter, you will see a skewed distribution toward users of age 15.
Does this make sense ? or maybe your use case is different.
Sorry, I’m not sure I follow your question. Let me provide an overly-simplified example just using the sample_0 dimension (50% sample) with alternating true/false
Interval 1
age: 15, sample_0: false
age: 15, sample_0: true
age: 15, sample_0: false
age: 15, sample_0: true
Interval 2
age: 15, sample_0: false
age: 45, sample_0: true
age: 65, sample_0: false
age: 95, sample_0: true
Total population: 8
Sampled population: 4
Average age of sample: (15 + 15 + 45 + 95) / 4 = 42.5
Does that align with your understanding, and if so can you elaborate the issue a little more?
Sorry, I’m not sure I follow your question. Let me provide an overly-simplified example just using the sample_0 dimension (50% sample) with alternating true/false
Interval 1
- age: 15, sample_0: false
- age: 15, sample_0: true
- age: 15, sample_0: false
- age: 15, sample_0: true
Interval 2
- age: 15, sample_0: false
- age: 45, sample_0: true
- age: 65, sample_0: false
- age: 95, sample_0: true
Total population: 8
Sampled population: 4
Average age of sample: (15 + 15 + 45 + 95) / 4 = 42.5
That’s exactly where i am not sure what makes sense from use case perspective, i would expect (15+45+65+95)/4 = 55.
Got it, thanks for clarifying. So you’re interested in the average unique age. In that case you could find the unique ages of the sample and then average.
The intent is to extract n rows out of N total rows randomly so that those rows can be analyzed outside of druid and inferences made about the population without trying to download all N rows. So if you had 15 million users in an interval, you could sample some much smaller number of those randomly and have basic summary statistics, histograms, down-sampled timeseries data, etc… Does that make sense?
General comments on sampling:
Thanks Charles. We’re trying to compliment the druid aggregations and timeseries data with views that attempt to show the underlying data (rows). Often it’ll be <1% of some interval. This would be helpful to make visualizations and summaries. One of our use-cases is displaying parallel coordinates of selected rows from an interval. This downloads sequentially, let’s say, 10k rows, which are then visualized via parallel coordinates. What would be more helpful is if that 10k rows was a random sample of rows from throughout the interval rather than the first 10k. We are mindful that we might want to weight these samples by some value depending on the goal. For example, if some row contained total_num_events, and we didn’t want to bias for rare events then we’d multiply each of the sample dimensions by total_num_events. Setting that aside though, assuming that we know what we want to do with the sampled rows, are there any problems with using this technique to sample random rows from druid? Do you think queries would be faster with the separate boolean dimensions or with one dimensions where multiple values are matched?
In general you’re going to get the best results if the thing you want to filter on is already a unique dimension value.
Druid does have the concept of a multi-value dimension, so you could have a dimension called sample_categories
and in there have any set of sample0… sample32 (or none)
OR you could have a bunch of dimensions each labeled sample0
… sample32
with value of either T
or nothing
There are differences in behavior of the two if you are using the dimension values in a RESULT (ex: taking topN sample_categories
)…
If you’re doing filtering with pretty default ingestion then it shouldn’t matter which one you’re using.
If you’re doing very customized ingestion (like specifying dimension indexing ordering) then it might make a difference where you have more control over the discrete dimensions route.
Hopefully that helps
That’s very helpful, thanks Charles!
Quick question - is random sampling something that could theoretically be built into druid or does it not make sense (because of the problems you’ve mentioned)?
Can you please explain how you are doing this at ingestion time, especially how you manage time bucket ?
this will provide more insights to me.
Then i guess we have to define what is exactly random sampling, is it over time or over dimensions or combinations of both ?
Proper handling of statistical related items is something that would be very interesting to get into druid. Most solutions I know of do not require modifications to core druid itself, but some do (ex: https://github.com/druid-io/druid/pull/2525 https://github.com/druid-io/druid/pull/2090 )