Question Over Rollup feature

Hi,

I had been researching on druid and while going through the docs, the below was mentioned as a feature.

Rollup The aggregation of data that occurs at one or more stages, based on settings in a configuration

What exactly does one or more stages mean here? and What are those stages?

Will druid roll up data even after segments are pushed to deep storage [to reduce the amount of data being stored by merging segments]?

How do we specify configurations for rollups? It is not seen anywhere in the spec file. Is there a complete spec file example, with all possible options and values?

Regards,

Tamil.s

Druid can roll-up/aggregate data both at ingestion time and query time. Typically, you would roll up data at ingestion time as much possible so that there is less work to do while querying and query responses are fast.

Roll up basically means that all the records with same timestamp(granularity of the timestamp can be configured to be ms, hour, day etc) and combination of dimensions will be aggregated into a single row inside the segment.

At ingestion time, it is configured via “queryGranularity” inside granularity spec ( http://druid.io/docs/latest/Ingestion.html#granularityspec ) . You can see where it appears in real time ingestion spec and batch ingestion spec (http://druid.io/docs/latest/Concepts-and-Terminology.html#specfile ) .

At query time, it is configured via granularity. You can read more about it at http://druid.io/docs/latest/Querying.html

– Himanshu

you can also see our blog http://druid.io/blog/2011/04/30/introducing-druid.html for learning more about rollup.

Just wondering how the rollups will span out if we want to do for daily basis or even monthly basis for historical data. Storing at an hour level aggregate is good for very recent data, but for data as old as say 6 months to an year, monthly rollups would be faster i guess.

Do you suggest rolling up as another datasource like say datasource_month ?

You don’t really need to create a new datasource. You would re-index old data with new granularity and that will create new segments. New segments will shadow the old segments from same interval. So, querying would happen on the newly created segments.

Another way for re-indexing, that does not required you to keep old data around is to use IngestSegmentFirehose (one downside is that it can’t run on hadoop cluster if that is what you use for most of your indexing). See http://druid.io/docs/latest/Ingestion-FAQ.html

– Himanshu

it can’t run on hadoop cluster
Does it mean druid can’t take pre-pushed segments from HDFS Deep storage and re-index via this method?

It means you will have to use IndexTask instead of HadoopIndexTask to do the reIndexing using IngestSegemntFirehose.
limitation with the IndexTask being its not scalable for very high amount of data.