does druid support pre-aggregation of data as part of ingestion?

Hi,

Ad-hoc queries are great, but we also need to serve predefined queries in millis: and for those, pre-aggregations are king (like group by on certain dimensions combinations done in advance).

Does Druid support also predefined queries by means of preaggregation in the process of ingestion? (i.e. realtime preaggregation on realtime nodes and batch preaggregation on a batch import).

I am thinking on a workflow like, we would use an HTTP endpoint to which we send requests like create new preaggregate or index with this json definition, or delete one preaggregate.

If this is possible, is it also possible to create some preaggregations from the input stream and then dump the granular events (the input stream itself)?

This would allow us to skip the Spark Streaming and just use one system, one metrics definitions/implementations, so on.

Please advise,

Nicu

Hi, please see inline.

Hi,

Ad-hoc queries are great, but we also need to serve predefined queries in millis: and for those, pre-aggregations are king (like group by on certain dimensions combinations done in advance).

Does Druid support also predefined queries by means of preaggregation in the process of ingestion? (i.e. realtime preaggregation on realtime nodes and batch preaggregation on a batch import).

Yes, this is set as part of the ingestion spec used at ingestion time. Ingested data can be rolled up to pre-defined granularities (minute, hour, day, etc.), or a custom granularity.

I am thinking on a workflow like, we would use an HTTP endpoint to which we send requests like create new preaggregate or index with this json definition, or delete one preaggregate.

If this is possible, is it also possible to create some preaggregations from the input stream and then dump the granular events (the input stream itself)?

This is also possible. Often times people will preaggregate data before loading it into Druid.

Hi,
You mentioned that data ingested data can be rolled up on the basis of some custom granularity. Can you give an example spec for it?

In my system, people generally do topN queries on some dimension. It would be great if this can be pre-aggregated for atleast some dimensions(5-10, there are around 80 dimensions).

Hey Saksham,

All of the ingestion specs (batch, realtime) can be given a “queryGranularity” to do rollup at ingestion time. By default this is “NONE” but you can also set it to “MINUTE” or “HOUR” or any query granularity object supported by Druid.

Hi,

How can I index different pre-aggregations of the same data? In the same data source or another?

Let’s say I import data into datasource “hourly_dimensions_metrics”, which is a primary aggregation of the actual events from which all the others are done.

(My understanding is that I can do a hadoop index task to scan directly events pre-joined logs and output “hourly_dimensions_metrics”, or have spark/hadoop map-reduce pre-compute it.)

Now I want to precompute a few aggregations on top of “hourly_dimensions_metrics”, like daily, or with fewer dimensions.

How do I do this? With more index tasks, having “hourly_dimensions_metrics” as both source and destination?

If I need to use different sources (if they have a fixed schema which I expect intuitively), than the query will no longer be agnostic of the indexing, meaning the query will contain the datasource (the index name). Is this correct?

Thanks,

Nicu

Inline.

Hi,

How can I index different pre-aggregations of the same data? In the same data source or another?

You can reindex data for different time intervals.

Let’s say I import data into datasource “hourly_dimensions_metrics”, which is a primary aggregation of the actual events from which all the others are done.

(My understanding is that I can do a hadoop index task to scan directly events pre-joined logs and output “hourly_dimensions_metrics”, or have spark/hadoop map-reduce pre-compute it.)

Now I want to precompute a few aggregations on top of “hourly_dimensions_metrics”, like daily, or with fewer dimensions.

How do I do this? With more index tasks, having “hourly_dimensions_metrics” as both source and destination?

Druid uses MVCC and every segment has a version. When you create new segments with modified data for the same interval, they replace old segments that were there before.

If druid already aggregating the data on granularities what are the advantages of providing pre aggregated data?