Ingestion: Rollup and Cardinality

Rollup is a feature that can dramatically reduce the size of data stored in Druid, by summarizing raw data during the ingestion phase. It can potentially reduce the number of rows stored, and therefore cluster cost, by a factor of 10x or more. This post provides some useful SQL on the underlying Apache Druid metadata tables that help you estimate the benefit of rollup for your dataset without needing to re-ingest it.

Roll-up is a GROUP BY-type action that takes place when data is ingested. And just like a ‘GROUP BY’ you may choose to emit [aggregations or data sketches] (https://druid.apache.org/docs/latest/ingestion/index.html#metricsspec) in the metricsSpec section of your ingestion specification.

The only variable in the efficiency of roll-up - i.e. its ability to turn 10 rows into 1 row - is the same as it is for any GROUP BY operation: the cardinality of dimensions you have in your SELECT. In Druid’s case, that’s the source data columns that you list in the ‘dimensionSpec’ of your ingestion specification.

A discrete piece of functionality in Druid is to automatically truncate incoming timestamps. That’s done by specifying a queryGranularity in the ingestion spec. Here’s an example data set where queryGranularity processing is at FIVE_MINUTE. Every event, instead of being ingested with the original stamp, has had its timestamp truncated.

Time Name Dance
09:00 Peter Fandango
09:00 John Fandango
09:00 Peter Fandango
09:05 John Fandango
09:05 John Fandango
09:05 John Fandango
09:05 Peter Waltz
09:05 Peter Waltz

Now let’s turn on roll-up. And imagine that we’ve added a metric of count (remember, you don’t have to emit metrics) because - well, we can! Our queryGranularity-truncated events have a good roll-up: each column has low cardinality within that time period.

Time Name Dance Count
09:00 Peter Fandango 2
09:00 John Fandango 1
09:05 John Fandango 3
09:05 Peter Waltz 2

Knowing the cardinality of your incoming data is essential to improving your roll-up ratio. Think about this example:

Time Name Dance
09:00 Peter Fandango
09:00 Mary Fandango
09:00 Tom Fandango
09:05 Brian Fandango
09:05 Peter Fandango
09:05 Mary Fandango
09:05 Tom Fandango
09:05 Terry Fandango

Notice that, within the 5 minutes buckets (our queryGranularity truncated timestamp) every single event relates to a different dancer. It doesn’t even matter that each person always dances the same dance (they must really love the fandango) or even that a given dancer dances the same dance later on. Our 8 incoming rows get stored as 8 rows because the GROUP BY is across all of the dimensions.

Time Name Dance Count
09:00 Peter Fandango 1
09:00 Mary Fandango 1
09:00 Tom Fandango 1
09:05 Brian Fandango 1
09:05 Peter Fandango 1
09:05 Mary Fandango 1
09:05 Tom Fandango 1
09:05 Terry Fandango 1

And there’s a second scenario: lots of combinations of values.

Time Name Dance
09:00 Peter Fandango
09:00 Mary Polka
09:00 Mary Vogue
09:05 Brian Fandango
09:05 Lucy Waltz
09:05 Claire Fandango
09:05 Sian Waltz
09:05 Terry Waltz

Here there are just too many combinations of values in a period of time. In each five minute interval, every dancer dances a different dance. The roll-up just doesn’t work - it is, after all, a GROUP BY.

Time Name Dance Count
09:00 Peter Fandango 1
09:00 Mary Polka 1
09:00 Mary Vogue 1
09:05 Brian Fandango 1
09:05 Lucy Waltz 1
09:05 Claire Fandango 1
09:05 Sian Waltz 1
09:05 Terry Waltz 1

One cause for this combined cardinality problem could be hierarchy. Let’s imagine that Peter is a specialist in the Fandango and Voguing. (Well done, Peter). John, meanwhile, is king of the Foxtrot, Waltz, and Paso Doble.

Time Teacher Dance
09:00 Peter Fandango
09:00 John Foxtrot
09:00 Peter Vogue
09:05 Peter Fandango
09:05 John Foxtrot
09:05 John Waltz
09:05 Peter Fandango
09:05 John Paso Doble

The roll-up ends up looking like this:

Time Teacher Dance Count
09:00 Peter Fandango 1
09:00 John Foxtrot 1
09:00 Peter Vogue 1
09:05 Peter Fandango 2
09:05 John Foxtrot 1
09:05 John Waltz 1
09:05 John Paso Doble 1

Here, the roll-up is less effective because each dancer (the root) knows a distinct set of dances (the leaves) and it’s very unlikely that they’d repeat the same dance in the same roll-up period.

You can look at data you have ingested already to get a feel for its profile.

You can find the number of rows in a one hour period simply by using the Druid console:

SELECT COUNT(*) AS rowCount
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR

Finding cardinality is very easy as well:

SELECT COUNT (DISTINCT your-column) AS columnCardinality
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR

Of course, you can do more than one at once, but just be cautious - on large datasets this can swamp your cluster…

SELECT COUNT (DISTINCT your-column-1) AS column1Cardinality,
   COUNT (DISTINCT your-column-2) AS column2Cardinality,
   :
   :
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR

Even more useful is the ratio of rows to unique values of a column. If you have the wikipedia edits sample data loaded, try this query, which gives you the ratio for just a few of the columns in the data set. (Notice that the __time column WHERE clause is absent.)

SELECT CAST(COUNT (DISTINCT cityName) AS FLOAT) / COUNT(*) AS cityNameRowratio,
   CAST(COUNT (DISTINCT channel) AS FLOAT) / COUNT(*) AS channelRowratio,
   CAST(COUNT (DISTINCT cityName) AS FLOAT) / COUNT(*) AS cityNameRowratio,
   CAST(COUNT (DISTINCT comment) AS FLOAT) / COUNT(*) AS commentRowratio,
   CAST(COUNT (DISTINCT countryIsoCode) AS FLOAT) / COUNT(*) AS countryIsoCodeRowratio,
   CAST(COUNT (DISTINCT countryName) AS FLOAT) / COUNT(*) AS countryNameRowratio,
   CAST(COUNT (DISTINCT diffUrl) AS FLOAT) / COUNT(*) AS diffUrlRowratio,
   CAST(COUNT (DISTINCT flags) AS FLOAT) / COUNT(*) AS flagsRowratio,
   CAST(COUNT (DISTINCT isAnonymous) AS FLOAT) / COUNT(*) AS isAnonymousRowratio,
   CAST(COUNT (DISTINCT isMinor) AS FLOAT) / COUNT(*) AS isMinorRowratio,
   CAST(COUNT (DISTINCT isNew) AS FLOAT) / COUNT(*) AS isNewRowratio,
   CAST(COUNT (DISTINCT isRobot) AS FLOAT) / COUNT(*) AS isRobotRowratio
FROM "wikipedia"

Those approaching 1 are the main cause of your low roll-up. In the wikipedia dataset, it’s clearly the diffUrl column. At the other of the scale are indicators of queries that are suffering because of the poor roll-up - like the wikipedia sample data columns that start with is.

The next step, whether your data has high compound cardinality, is more tricky. You can use the query above to create combinations of dimensions to assess, say, the dancer and the dance.

SELECT __time, COUNT(*) AS rowCount,
    COUNT (DISTINCT columnName) AS columnCardinality
FROM "datapipePMRawEvents"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY 1, 2

So what should you do?

Your options boil down to cardinality shrinking or dimension shrinking.

On the dimension-shrinking front, a common pattern is to store two copies of the data in Druid: one with all fields, and one with a commonly-used subset of fields designed to take advantage of rollup. Then, you can query the appropriate dataset for a particular use case.

On the cardinality-shrinking front, you can turn high cardinality columns with a datasketch. Remember this?

Time Name Dance
09:00 Peter Fandango
09:00 Mary Fandango
09:00 Tom Fandango
09:05 Brian Fandango
09:05 Peter Fandango
09:05 Mary Fandango
09:05 Tom Fandango
09:05 Terry Fandango

If all we need is an approximate distinct count, or to do a set operation say for funnel analysis, we could leave the name column out and emit a datasketch instead.

Time Name Name-HLLSketch
09:00 Fandango fhvwwofihqwkjwd
09:05 Fandango fhvwwofihqwkjqf

In fact, you could go further and increase your queryGranularity to HOUR and end up with just one row instead of hundreds. Millions even. This is particularly important for advertising and network data on columns like auction-identifier or ip-address when filtering, sorting, and using the data for aggregates is not required.

Or perhaps, again in cardinality-shrinking-land, it would be worth checking whether particular filters are always applied in your queries. You could either apply that in the transformSpec to remove some cardinality, or you could create one dataset with all values, and another with highly-focused filters applied.

If you fancy scripting this kind of work, you can start with this query to get a list of the columns — maybe you could paste this output into a spreadsheet to create a single query with the names of all the columns so you don’t have to do each one manually - or perhaps you fancy writing a Druid data profiler…!

SELECT "COLUMN_NAME"
FROM INFORMATION_SCHEMA.COLUMNS
WHERE "TABLE_NAME" = 'your-dataset'

When rolled-up data is used, there are substantial query performance and cost benefits - a little research like this up front on the volumes of data Druid often handles can make a disproportionately humongous difference!

This is a community article! If you have suggestions and thoughts, Reply Now and let’s help make this resource better.

Tested on Druid 0.19

@Matt_Sarrel and @Rachel_Pedreschi I’d welcome your comments on this post.