Pre aggregated data from druid.

Druid supports both raw and pre aggregated data (on dimensions) with ingestion. What are the advantages and disadvantages of providing pre aggregated data? Also how druid aggregate raw data?

By ingesting pre aggregated data to druid, will make ingestion “faster” since most of the work will be done pre ingestion.

By rolling up data, or pre-aggregating it, you are reducing the storage requirements if your data rolls up well, and you will save costs on hardware of running a Druid cluster. In practice, rolling up data can reduce your storage requirements by an average of 40x. The tradeoff is that you won’t lose fidelity in your metrics, but you will lose information about the exact time an event occurred (due to truncation).

Thanks for the quick response Fangjin and Slim. I am adding raw data in druid using spark streaming and tranquility. Now I have two more questions.

1 My realtime streaming is running at interval of 2 mins but I need time granularity of hour(in druid). As my streaming interval is of 2 mins I can pre aggregate data for 2 mins only. I have to ingest data immediately as I need to support till now queries also. I can run reindexing to get pre aggregated data for past hours. Is there is any other way to achieve same.

2 While querying druid using plyql I am getting rows with already aggregated data. Like if I add 100 events with same dimensions and only count as my metric, on querying “select * from datasource” i am getting 1 row with count as 100. So is druid itself aggregating some data before ingestion?

Inline.

Thanks for the quick response Fangjin and Slim. I am adding raw data in druid using spark streaming and tranquility. Now I have two more questions.

1 My realtime streaming is running at interval of 2 mins but I need time granularity of hour(in druid). As my streaming interval is of 2 mins I can pre aggregate data for 2 mins only. I have to ingest data immediately as I need to support till now queries also. I can run reindexing to get pre aggregated data for past hours. Is there is any other way to achieve same.

Druid does rollup/pre-aggregation for you. Set the queryGranularity to configure this. You don’t need to do it in Spark.

2 While querying druid using plyql I am getting rows with already aggregated data. Like if I add 100 events with same dimensions and only count as my metric, on querying “select * from datasource” i am getting 1 row with count as 100. So is druid itself aggregating some data before ingestion?

Yes. See http://druid.io/docs/0.9.1.1/design/index.html