Downsampling in Druid

Hello!
I faced with huge data amount (points) in graphs hence decided use downsampling, but did not find anything about it in Druid docs.
To solve this problem, I need to find a way to reduce the number of data points I’m getting from my data source (I use Druid). Unfortunately, doing this in a manner that doesn’t drastically deform our graph is actually a very tricky problem.
May you help me with this issue? Druid does have helpful tool describe above?

Forgot to said, I use Grafana for viewing graphs and Druid sql query.

There’s no need for downsampling. One of Druid’s most powerful features is rollup on your chosen dimensions with applied metrics on those dimensions over a granularity of your choice (seconds, hour, day, etc). This feature will significantly reduce the large amount of data points you’re describing.

Here’s a the doc in the tutorial describing the benefits of rollup:

Thank you @J_B,
How can I count quantile with rollup?
Here my sql query example before using rollup:

SELECT
  __time as "time",
  DS_GET_QUANTILE(sla, 0.9) as " ",
  name as "service"
FROM SYTESTER_DRUID
-- WHERE "name" IN (select case when '${service}' = 'Show All' then name else '${service}' end from SYTESTER_DRUID)
-- where name in (${service:sqlstring})
WHERE ('${service}' = 'Show All' OR name in (${service:sqlstring}))
AND __time >= MILLIS_TO_TIMESTAMP(${__from}) AND __time <= MILLIS_TO_TIMESTAMP(${__to})
GROUP BY __time, name

As you can see I get quantile from sla column but with rollup I does not have that one. Which type should I choose from these:


?
And How sql query will be changed?

Have seen momentSketch type in docs and tried to create metrics:

But after click on the button “Apply” I’ve got the error:

What did I do wrong?

Hi @nik ,

You can do this directly in SQL, you were almost there, but the DS_GET_QUANTILE function takes a quantiles sketch as a first parameter, so in order to do this you need to first aggregate sla into a quantiles sketch as in this example with the wikipedia test data:

SELECT
  TIME_FLOOR(__time, 'PT1H') as "time",
  DS_GET_QUANTILE(DS_QUANTILES_SKETCH(added), 0.9) as " ",
  page as "service"
FROM "wikiticker-2015-09-12-sampled"
WHERE (page in ('Jeremy Corbyn'))
GROUP BY 1, 3
ORDER BY 1 DESC

In your case it seems to be a Grafana query. You can use:

SELECT
  __time as "time",
  DS_GET_QUANTILE( DS_QUANTILES_SKETCH(sla), 0.9) as "metric",
  name as "service"
FROM SYTESTER_DRUID
-- WHERE "name" IN (select case when '${service}' = 'Show All' then name else '${service}' end from SYTESTER_DRUID)
-- where name in (${service:sqlstring})
WHERE ('${service}' = 'Show All' OR name in (${service:sqlstring}))
AND __time >= MILLIS_TO_TIMESTAMP(${__from}) AND __time <= MILLIS_TO_TIMESTAMP(${__to})
GROUP BY __time, name

Unless you’ve done this at ingestion time, you’ll likely want to transform __time in the query to the minimum time granularity that you want using a TIME_FLOOR or TIME_CEIL function.

You might also want to do this at ingestion time, in that case you could do something like:

REPLACE INTO targetTable OVERWRITE ALL
SELECT
  __time as "time", -- use TIME_FLOOR to aggregate by time bucket at ingestion
  name as "service",
 DS_QUANTILES_SKETCH(sla) as "metricQS"
FROM EXTERN ... etc
GROUP BY 1,2

This will ingest the data and store the sketch without resolving it, allowing for further aggregation at query time across the dimensions you ingest.
You will need to set the query context parameter finalizeAggregations: false in order to make the planner store the internal form of the sketches at ingestion.

Thanks for answer @Sergio_Ferragut!
You’re right, I’ve made a mistake in sql query. But it’s not what I’ve asked.
There’re 3 columns of numeric type: rnd as long, status as long, sla as double in my data table. And when I setup rollup grouping is happened by these 3 columns.
Data before rollup:

Data after rollup:

Hence I get aggregated “sla” column as “sum_sla”. I gues, it won’t be right way to use “sum_sla” in query like DS_GET_QUANTILE(DS_QUANTILES_SKETCH(sum_sla), 0.9). Also I tried to create extra metric column “max_sla” or “min_sla” on base of sla as well. But I thought it’s not right way to calculate quantile too.
Please, could you explain how to calculate quantile with rollup enabled?

Yes @nik, you are correct.
At ingestion when doing rollup you can add a metric using the quantilesDoubleSketch aggregation. You can do this by adding a metric on Load Data/Configure Schema view when building a spec from the Druid Console:

The new metric column will be stored as a sketch and it can be used in further aggregation queries such as:

Some more docs on the subject.

Let us know how it goes.

@Sergio_Ferragut, thanks for help!
quantilesDoubleSketch works very well for me.

@@Sergio_Ferragut, may you explain why does not exist mean or avg aggregate function like max, min, sum for rolled up data?

Sure. Since rollup requires a state of the calculation that can be further rolled up. With mean or average, if we store the average during ingestion rollup, we would then calculate an average of averages when querying that data.

Instead the recommendation is to use SUM(metric) and COUNT at rollup and calculate the average using SUM( metricSum) / SUM ( metricCount) to obtain accurate averages while aggregating on other dimensions at query time.

@Sergio_Ferragut, thanks!