Random Samples from an interval

What do you think of this design for randomly sampling an interval for rows?

http://codepen.io/markduane/full/9e5e2d7fff410b9e49d2213621181fe4/

Hi Duane,
i am not sure i am getting the idea very well. So when you said you add sample_i is it at ingestion time ?
If so i guess it is a ingestion time, i am having trouble understanding how this can be true outside a given interval.

For instance suppose you do your sampling based on a 15 Minutes interval and suppose we only have 2 intervals.
So assume at interval 1 we have seen only event A so getting 50% of the population which contains only A. But at interval 2 we have now an uniform distribution of A,B,C,D as population. so asking 50% of Interval 2 will probably return a uniform sample of A,B,C,D.

Then the curx IMO how you can compute the sample of 50% for Inteval1+2 ?

Thanks Slim,

So I should state that this would be a “select” query that matches these filters, and each row would have different settings for the sample_i dimensions. If each row contains multiple events then it might be possible to do a variation on this idea and weight the sample_i dimensions by the number of events contained in each row. Does that make sense?

Also, yes this would be at ingest time.

So assuming interval 1 contains 8 total matching rows (all of which are A), and interval 2 contains 8 matching rows (2 of each A-D), then requesting 50% of both intervals would return each of the 16 rows with 1/2 probability, on average some selection of 8 rows.

i am not sure this make sense or not but here is my understanding

assume that my data contains (time_visit User_ID, User_age) now assume i would like to know what is a representative 50% sample of the ages.

If we keep the same example, like we have only users with age 15 logged on at interval 1 (late midnight ) and user with age 15/ 45 /65 /95 at interval 2(working hours). So if you apply filter, you will see a skewed distribution toward users of age 15.

Does this make sense ? or maybe your use case is different.

Sorry, I’m not sure I follow your question. Let me provide an overly-simplified example just using the sample_0 dimension (50% sample) with alternating true/false
Interval 1

  • age: 15, sample_0: false

  • age: 15, sample_0: true

  • age: 15, sample_0: false

  • age: 15, sample_0: true
    Interval 2

  • age: 15, sample_0: false

  • age: 45, sample_0: true

  • age: 65, sample_0: false

  • age: 95, sample_0: true
    Total population: 8

Sampled population: 4

Average age of sample: (15 + 15 + 45 + 95) / 4 = 42.5

Does that align with your understanding, and if so can you elaborate the issue a little more?

Sorry, I’m not sure I follow your question. Let me provide an overly-simplified example just using the sample_0 dimension (50% sample) with alternating true/false
Interval 1

  • age: 15, sample_0: false
  • age: 15, sample_0: true
  • age: 15, sample_0: false
  • age: 15, sample_0: true
    Interval 2
  • age: 15, sample_0: false
  • age: 45, sample_0: true
  • age: 65, sample_0: false
  • age: 95, sample_0: true
    Total population: 8

Sampled population: 4

Average age of sample: (15 + 15 + 45 + 95) / 4 = 42.5

That’s exactly where i am not sure what makes sense from use case perspective, i would expect (15+45+65+95)/4 = 55.

Got it, thanks for clarifying. So you’re interested in the average unique age. In that case you could find the unique ages of the sample and then average.

The intent is to extract n rows out of N total rows randomly so that those rows can be analyzed outside of druid and inferences made about the population without trying to download all N rows. So if you had 15 million users in an interval, you could sample some much smaller number of those randomly and have basic summary statistics, histograms, down-sampled timeseries data, etc… Does that make sense?

General comments on sampling:

  • The rollup druid does really screws with sampling. Depending on the dataset and how much rollup you get, you may end up not having any space or query speed savings by sampling the data.
  • Sampling is NOT guaranteed to work once the data has made it into Druid. The crux of this is that data is rolled up once it enters druid. In the very simple case where you simply have two events A and B, if you have a million A events and 1 B event within a specific QueryGranularity interval, then they will get rolled up into one row for event A and one row for event B. This means that sampling on the QUERY side for druid, with no account at ingestion time, artificially favors rare events.
  • If you are doing an approach as per the link, where you are essentially taking some sort of random hash of an event (or a part of an event), and having multiple entries for how many bits from the hash you are using to generate your key ( hash>>(32 - hashPower) == 0 for example with an int hash)… then you are skipping events and reducing the overall AGGREGATION work that needs to be done, but not necessarily the amount of paging in that has to occur. For example, if you have events each of LONG size (8Bytes), then in 1 on-disk block (4k) can hold about 512 aggregation metrics… UNCOMPRESSED (you are probably going to be using compression which will make the on disk block hold even more events). If you hit that block and page it into memory you’re probably getting 512 events loaded into memory regardless of how many you are using. As such, unless you are using a small sample (0.1% or less) it is not obvious you’ll get much benefit from the point of view of paging stuff into memory. So… maybe helpful (if you are user-cpu bound) but maybe not (if you are io-wait bound).
  • Getting high accurate results on sampled data can be tricky. You should make sure you chat with someone who is up to date on their statistics theory and practice.

Thanks Charles. We’re trying to compliment the druid aggregations and timeseries data with views that attempt to show the underlying data (rows). Often it’ll be <1% of some interval. This would be helpful to make visualizations and summaries. One of our use-cases is displaying parallel coordinates of selected rows from an interval. This downloads sequentially, let’s say, 10k rows, which are then visualized via parallel coordinates. What would be more helpful is if that 10k rows was a random sample of rows from throughout the interval rather than the first 10k. We are mindful that we might want to weight these samples by some value depending on the goal. For example, if some row contained total_num_events, and we didn’t want to bias for rare events then we’d multiply each of the sample dimensions by total_num_events. Setting that aside though, assuming that we know what we want to do with the sampled rows, are there any problems with using this technique to sample random rows from druid? Do you think queries would be faster with the separate boolean dimensions or with one dimensions where multiple values are matched?

In general you’re going to get the best results if the thing you want to filter on is already a unique dimension value.

Druid does have the concept of a multi-value dimension, so you could have a dimension called sample_categories and in there have any set of sample0… sample32 (or none)

OR you could have a bunch of dimensions each labeled sample0sample32 with value of either T or nothing

There are differences in behavior of the two if you are using the dimension values in a RESULT (ex: taking topN sample_categories)…

If you’re doing filtering with pretty default ingestion then it shouldn’t matter which one you’re using.

If you’re doing very customized ingestion (like specifying dimension indexing ordering) then it might make a difference where you have more control over the discrete dimensions route.

Hopefully that helps

That’s very helpful, thanks Charles!

Quick question - is random sampling something that could theoretically be built into druid or does it not make sense (because of the problems you’ve mentioned)?

Can you please explain how you are doing this at ingestion time, especially how you manage time bucket ?
this will provide more insights to me.

Then i guess we have to define what is exactly random sampling, is it over time or over dimensions or combinations of both ?

Proper handling of statistical related items is something that would be very interesting to get into druid. Most solutions I know of do not require modifications to core druid itself, but some do (ex: https://github.com/druid-io/druid/pull/2525 https://github.com/druid-io/druid/pull/2090 )