95 and 99 percentiles

Lets assume each row has a field called value that is some integer:

{

“value”: 42,

}

``

Each five minutes we receive a new row. Let’s also assume that we have granularity defined as 1 hour.

Then:

  1. is there a way to create an aggregator that will calculate 95 and 99 percentile of **value**s?
  2. assuming that I have a reindex job that takes the same data source as in (1) and reindex that into new data source that have granularity set to 1 day is there a way to have correctly calculated 95 and 99 percentiles on 1 day scale?
    Thank you

Hey Vadim,

For percentiles check out http://druid.io/docs/latest/development/extensions-core/approximate-histograms.html or http://druid.io/docs/latest/development/extensions-core/datasketches-quantiles.html (the latter is newer and generally better). We are also working on a doc update / blog post talking about recommendations in this area – so stay tuned!

The quantile sketch aggregators all do properly handle the case (2) you brought up.

approximate histograms is pretty rough. Datasketches might work better but I don’t have production experience with it. I’ve used bucketized values before to reasonable success, where you define a logrithmic range and do some ETL on the value to make a dimension for the bucket (ex: dimensions something similar to 1-2, 2-4, 4-8, 8-16), then do some extra processing on the query results. This tends to work well for highly skewed data where the relative magnitude is more important than the actual magnitude.

If your data is normal-ish you can also check out the stats module and use the mean and stdev .

Thank you, guys, for quick response!

I have one more question then.

Is it possible to apply 95 percentile calculation not to each and every row, but only on rows that have dimension equal to defined value?

For example lets assume that I collect wind speed and temperature, and each row contains a marker what metric was it:

{
“type”: “wind speed”,
“value”: 16
}

``

And I only want 95 percentile to be calculated (on rollup) for rows where “type” equals to “wind speed”.

Since I have a lot of types that do not need percentile to be calculated I assume that having it calculated on all rows will decrease performance a lot.

You could probably do that with a filtered aggregator at ingest time.

Would be great if an example can be given on this, I would much appreciate.

There’s an example in the Druid docs at the bottom of this page (search for “filtered”): http://druid.io/docs/latest/querying/aggregations. It should work in the metricsSpec of an ingestion job.

Gian

Hi Gian,

We are evaluating Druid for our company for real time analytics. As part of that I am working on POC’s about Druid’s feature set.

Currently, I am working on finding mean, median, min, max and 95 percentiles using Datasketches extensions library.

I was able to ingest the data using quantiles sketches, but not sure how do i query the data which I defined type as quantilesDoublesSketch in metricsSpec of my ingestion spec. It would be a great help if someone can help me with examples.

Thanks,

Sreekanth.G

You should be able to do it with “quantilesDoublesSketch” and “quantilesDoublesSketchToQuantile” described on: http://druid.io/docs/latest/development/extensions-core/datasketches-quantiles.html

Hi Gian,

My metricsSpec is in ingestion spec is as below. I was able to ingest and quantile_unit_price column is created in datasource. How do I query that column ? The query I tried and associated error is as mentioned below. Could you please check and help me out.

“metricsSpec” : [

{ “type” : “count”, “name” : “views” },

{ “type”: “thetaSketch”, “name”: “visit_sketch”, “fieldName”: “visit” },

{ “type”: “quantilesDoublesSketch”, “name”: “qualtile_unit_price”, “fieldName”: “unit_price” },

{ “type” : “longSum”, “name” : “totalquantity”, “fieldName” : “quantity” },

{ “type” : “doubleSum”, “name” : “total_unit_price”, “fieldName” : “unit_price” },

{ “type” : “doubleSum”, “name” : “total_discount”, “fieldName” : “discount” },

{ “type” : “doubleSum”, “name” : “total_taxable_total”, “fieldName” : “taxable_total” },

{ “type” : “doubleSum”, “name” : “total_shipping_costs”, “fieldName” : “shipping_costs” }

],

Query json spec:

{

“queryType”: “groupBy”,

“dataSource”: “productactivity_quantiles”,

“granularity”: “ALL”,

“dimensions”: ,

“postAggregations”: [

{ “type”: “quantilesDoublesSketchToQuantile”, “name”: “unitprice_quantile”, “fieldName”: “qualtile_unit_price”

}

],

“filter”: { “type”: “selector”, “dimension”: “st”, “value”: “510005” },

“intervals”: [ “2018-04-21T15:28:00.000Z/2018-04-21T19:29:00.000Z” ]

}

Error message:

{

“error” : “Unknown exception”,

“errorMessage” : “Instantiation of [simple type, class io.druid.query.aggregation.datasketches.quantiles.DoublesSketchToQuantilePostAggregator] value failed: field is null (through reference chain: java.util.ArrayList[0])”,

“errorClass” : “com.fasterxml.jackson.databind.JsonMappingException”,

“host” : null

I am able to find the appropriate query which is mentioned below.

{

“queryType”: “timeseries”,

“dataSource”: “productactivity_quantiles”,

“granularity”: { “type”: “period”, “period”: “PT1M”, “origin”: “2018-04-21T15:28:00.000Z” },

“aggregations”: [ {

“type”: “quantilesDoublesSketch”,

“fieldName”: “qualtile_unit_price”,

“name”: “unitprice_quantile”,

“k”: 128

}

],

“postAggregations”: [

{ “type”: “quantilesDoublesSketchToQuantile”, “name”: “final_unitprice_quantile”,

“field”: {

“type”: “fieldAccess”,

“fieldName”: “unitprice_quantile”

},

“fraction” : 0.95

}

],

“filter”: { “type”: “selector”, “dimension”: “st”, “value”: “510005” },

“intervals”: [ “2018-04-21T15:28:00.000Z/2018-04-21T19:29:00.000Z” ]

}

This worked for me.

Thanks,

Sreekanth.G