Looking at the effect of cardinality on rollup

I thought I’d share some lessons I learned about rollup and how cardinality effects it – just a little process I went through that may be useful to people.

I think of roll-up as a GROUP BY * that takes place when data is ingested. And just like a ‘GROUP BY’ you may choose to emit aggregations or data sketches in the metricsSpec section of your ingestion specification.

The only variable in the efficiency of roll-up - i.e. its ability to turn 10 rows into 1 row - is the same as it is for any GROUP BY operation: the cardinality of dimensions you have in your SELECT. In Druid’s case, that’s the source data columns that you list in the ‘dimensionSpec’ of your ingestion specification.

A discrete piece of functionality in Druid is to automatically truncate incoming timestamps. That’s done by specifying a queryGranularity in the ingestion spec. Here’s an example data set where queryGranularity processing is at FIVE_MINUTE. Every event, instead of being ingested with the original stamp, has had its timestamp truncated.

Time Name Dance
09:00 Peter Fandango
09:00 John Fandango
09:00 Peter Fandango
09:05 John Fandango
09:05 John Fandango
09:05 John Fandango
09:05 Peter Waltz
09:05 Peter Waltz

Now let’s turn on roll-up. And imagine that we’ve added a metric of count (remember, you don’t have to emit metrics) because - well, we can! Our queryGranularity-truncated events have a good roll-up: each column has low cardinality within that time period.

Time Name Dance Count
09:00 Peter Fandango 2
09:00 John Fandango 1
09:05 John Fandango 3
09:05 Peter Waltz 2

Knowing the cardinality of your incoming data is essential to improving your roll-up ratio. Think about this example:

Time Name Dance
09:00 Peter Fandango
09:00 Mary Fandango
09:00 Tom Fandango
09:05 Brian Fandango
09:05 Peter Fandango
09:05 Mary Fandango
09:05 Tom Fandango
09:05 Terry Fandango

Notice that, within the 5 minutes buckets (our queryGranularity truncated timestamp) every single event relates to a different dancer. It doesn’t even matter that each person always dances the same dance (they must really love the fandango) or even that a given dancer dances the same dance later on. Our 8 incoming rows get stored as 8 rows because the GROUP BY is across all of the dimensions.

Time Name Dance Count
09:00 Peter Fandango 1
09:00 Mary Fandango 1
09:00 Tom Fandango 1
09:05 Brian Fandango 1
09:05 Peter Fandango 1
09:05 Mary Fandango 1
09:05 Tom Fandango 1
09:05 Terry Fandango 1

And there’s a second scenario: lots of combinations of values.

Time Name Dance
09:00 Peter Fandango
09:00 Mary Polka
09:00 Mary Vogue
09:05 Brian Fandango
09:05 Lucy Waltz
09:05 Claire Fandango
09:05 Sian Waltz
09:05 Terry Waltz

Here there are just too many combinations of values in a period of time. In each five minute interval, every dancer dances a different dance. The roll-up just doesn’t work - it is, after all, a GROUP BY.

Time Name Dance Count
09:00 Peter Fandango 1
09:00 Mary Polka 1
09:00 Mary Vogue 1
09:05 Brian Fandango 1
09:05 Lucy Waltz 1
09:05 Claire Fandango 1
09:05 Sian Waltz 1
09:05 Terry Waltz 1

One cause for this combined cardinality problem could be hierarchy. Let’s imagine that Peter is a specialist in the Fandango and Voguing. (Well done, Peter). John, meanwhile, is king of the Foxtrot, Waltz, and Paso Doble.

Time Teacher Dance
09:00 Peter Fandango
09:00 John Foxtrot
09:00 Peter Vogue
09:05 Peter Fandango
09:05 John Foxtrot
09:05 John Waltz
09:05 Peter Fandango
09:05 John Paso Doble

The roll-up ends up looking like this:

Time Teacher Dance Count
09:00 Peter Fandango 1
09:00 John Foxtrot 1
09:00 Peter Vogue 1
09:05 Peter Fandango 2
09:05 John Foxtrot 1
09:05 John Waltz 1
09:05 John Paso Doble 1

Here, the roll-up is less effective because each dancer (the root) knows a distinct set of dances (the leaves) and it’s very unlikely that they’d repeat the same dance in the same roll-up period.

You can look at data you have ingested already to get a feel for its profile.

You can find the number of rows in a one hour period simply by using the Druid console:

SELECT COUNT(*) AS rowCount
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR

Finding cardinality is very easy as well:

SELECT COUNT (DISTINCT your-column) AS columnCardinality
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR

Of course, you can do more than one at once, but just be cautious - on large datasets this can swamp your cluster…

SELECT COUNT (DISTINCT your-column-1) AS column1Cardinality,
   COUNT (DISTINCT your-column-2) AS column2Cardinality,
   :
   :
FROM "your-dataset"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR

Even more useful is the ratio of rows to unique values of a column. If you have the wikipedia edits sample data loaded, try this query, which gives you the ratio for just a few of the columns in the data set. (Notice that the __time column WHERE clause is absent.)

SELECT CAST(COUNT (DISTINCT cityName) AS FLOAT) / COUNT(*) AS cityNameRowratio,
   CAST(COUNT (DISTINCT channel) AS FLOAT) / COUNT(*) AS channelRowratio,
   CAST(COUNT (DISTINCT cityName) AS FLOAT) / COUNT(*) AS cityNameRowratio,
   CAST(COUNT (DISTINCT comment) AS FLOAT) / COUNT(*) AS commentRowratio,
   CAST(COUNT (DISTINCT countryIsoCode) AS FLOAT) / COUNT(*) AS countryIsoCodeRowratio,
   CAST(COUNT (DISTINCT countryName) AS FLOAT) / COUNT(*) AS countryNameRowratio,
   CAST(COUNT (DISTINCT diffUrl) AS FLOAT) / COUNT(*) AS diffUrlRowratio,
   CAST(COUNT (DISTINCT flags) AS FLOAT) / COUNT(*) AS flagsRowratio,
   CAST(COUNT (DISTINCT isAnonymous) AS FLOAT) / COUNT(*) AS isAnonymousRowratio,
   CAST(COUNT (DISTINCT isMinor) AS FLOAT) / COUNT(*) AS isMinorRowratio,
   CAST(COUNT (DISTINCT isNew) AS FLOAT) / COUNT(*) AS isNewRowratio,
   CAST(COUNT (DISTINCT isRobot) AS FLOAT) / COUNT(*) AS isRobotRowratio
FROM "wikipedia"

Those approaching 1 are the main cause of your low roll-up. In the wikipedia dataset, it’s clearly the diffUrl column. At the other of the scale are indicators of queries that are suffering because of the poor roll-up - like the wikipedia sample data columns that start with is.

The next step, whether the data has high compound cardinality, is more tricky. So I used the query above to create combinations of dimensions to assess, say, the dancer and the dance.

SELECT __time, COUNT(*) AS rowCount,
    COUNT (DISTINCT columnName) AS columnCardinality
FROM "datapipePMRawEvents"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY 1, 2

What I decided to do in my instance was to either shrink the number of dimensions (so maybe have two tables, one with all fields, and one with a commonly-used subset of fields designed to take advantage of rollup - then, I could query the appropriate dataset for a particular use case.). The other was to attack the cardinality problem. Maybe with a datasketch (if it is COUNT DISTINCT / set operation / Quantile) though I’ve heard some people using clever functions (from simple truncation to using if-then-else to create a new dimension).

In my instance I could then increase the queryGranularity to HOUR and I ended up with just one row instead of hundreds. It was particularly important as I was working in a clickstream project – and that has tonnes of data!

There was another option as well – which was to create different tables with some common filters already applied – that reduced the row count and had a big impact on cardinality as well.

Hope that’s helpful to share!

Tested on Druid 0.19

2 Likes

@Matt_Sarrel and @Rachel_Pedreschi I’d welcome your comments on this post.