Rollup is a feature that can dramatically reduce the size of data stored in Druid, by summarizing raw data during the ingestion phase. It can potentially reduce the number of rows stored, and therefore cluster cost, by a factor of 10x or more. This post provides some useful SQL on the underlying Apache Druid metadata tables that help you estimate the benefit of rollup for your dataset without needing to re-ingest it.
Roll-up is a GROUP BY-type action that takes place when data is ingested. And just like a ‘GROUP BY’ you may choose to emit [aggregations or data sketches] (https://druid.apache.org/docs/latest/ingestion/index.html#metricsspec) in the
metricsSpec section of your ingestion specification.
The only variable in the efficiency of roll-up - i.e. its ability to turn 10 rows into 1 row - is the same as it is for any
GROUP BY operation: the cardinality of dimensions you have in your
SELECT. In Druid’s case, that’s the source data columns that you list in the ‘dimensionSpec’ of your ingestion specification.
A discrete piece of functionality in Druid is to automatically truncate incoming timestamps. That’s done by specifying a
queryGranularity in the ingestion spec. Here’s an example data set where
queryGranularity processing is at
FIVE_MINUTE. Every event, instead of being ingested with the original stamp, has had its timestamp truncated.
Now let’s turn on roll-up. And imagine that we’ve added a metric of
count (remember, you don’t have to emit metrics) because - well, we can! Our
queryGranularity-truncated events have a good roll-up: each column has low cardinality within that time period.
Knowing the cardinality of your incoming data is essential to improving your roll-up ratio. Think about this example:
Notice that, within the 5 minutes buckets (our
queryGranularity truncated timestamp) every single event relates to a different dancer. It doesn’t even matter that each person always dances the same dance (they must really love the fandango) or even that a given dancer dances the same dance later on. Our 8 incoming rows get stored as 8 rows because the
GROUP BY is across all of the dimensions.
And there’s a second scenario: lots of combinations of values.
Here there are just too many combinations of values in a period of time. In each five minute interval, every dancer dances a different dance. The roll-up just doesn’t work - it is, after all, a
One cause for this combined cardinality problem could be hierarchy. Let’s imagine that Peter is a specialist in the Fandango and Voguing. (Well done, Peter). John, meanwhile, is king of the Foxtrot, Waltz, and Paso Doble.
The roll-up ends up looking like this:
Here, the roll-up is less effective because each dancer (the root) knows a distinct set of dances (the leaves) and it’s very unlikely that they’d repeat the same dance in the same roll-up period.
You can look at data you have ingested already to get a feel for its profile.
You can find the number of rows in a one hour period simply by using the Druid console:
SELECT COUNT(*) AS rowCount FROM "your-dataset" WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
Finding cardinality is very easy as well:
SELECT COUNT (DISTINCT your-column) AS columnCardinality FROM "your-dataset" WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
Of course, you can do more than one at once, but just be cautious - on large datasets this can swamp your cluster…
SELECT COUNT (DISTINCT your-column-1) AS column1Cardinality, COUNT (DISTINCT your-column-2) AS column2Cardinality, : : FROM "your-dataset" WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
Even more useful is the ratio of rows to unique values of a column. If you have the wikipedia edits sample data loaded, try this query, which gives you the ratio for just a few of the columns in the data set. (Notice that the
WHERE clause is absent.)
SELECT CAST(COUNT (DISTINCT cityName) AS FLOAT) / COUNT(*) AS cityNameRowratio, CAST(COUNT (DISTINCT channel) AS FLOAT) / COUNT(*) AS channelRowratio, CAST(COUNT (DISTINCT cityName) AS FLOAT) / COUNT(*) AS cityNameRowratio, CAST(COUNT (DISTINCT comment) AS FLOAT) / COUNT(*) AS commentRowratio, CAST(COUNT (DISTINCT countryIsoCode) AS FLOAT) / COUNT(*) AS countryIsoCodeRowratio, CAST(COUNT (DISTINCT countryName) AS FLOAT) / COUNT(*) AS countryNameRowratio, CAST(COUNT (DISTINCT diffUrl) AS FLOAT) / COUNT(*) AS diffUrlRowratio, CAST(COUNT (DISTINCT flags) AS FLOAT) / COUNT(*) AS flagsRowratio, CAST(COUNT (DISTINCT isAnonymous) AS FLOAT) / COUNT(*) AS isAnonymousRowratio, CAST(COUNT (DISTINCT isMinor) AS FLOAT) / COUNT(*) AS isMinorRowratio, CAST(COUNT (DISTINCT isNew) AS FLOAT) / COUNT(*) AS isNewRowratio, CAST(COUNT (DISTINCT isRobot) AS FLOAT) / COUNT(*) AS isRobotRowratio FROM "wikipedia"
Those approaching 1 are the main cause of your low roll-up. In the
wikipedia dataset, it’s clearly the
diffUrl column. At the other of the scale are indicators of queries that are suffering because of the poor roll-up - like the
wikipedia sample data columns that start with
The next step, whether your data has high compound cardinality, is more tricky. You can use the query above to create combinations of dimensions to assess, say, the dancer and the dance.
SELECT __time, COUNT(*) AS rowCount, COUNT (DISTINCT columnName) AS columnCardinality FROM "datapipePMRawEvents" WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR GROUP BY 1, 2
So what should you do?
Your options boil down to cardinality shrinking or dimension shrinking.
On the dimension-shrinking front, a common pattern is to store two copies of the data in Druid: one with all fields, and one with a commonly-used subset of fields designed to take advantage of rollup. Then, you can query the appropriate dataset for a particular use case.
On the cardinality-shrinking front, you can turn high cardinality columns with a datasketch. Remember this?
If all we need is an approximate distinct count, or to do a set operation say for funnel analysis, we could leave the
name column out and emit a datasketch instead.
In fact, you could go further and increase your
HOUR and end up with just one row instead of hundreds. Millions even. This is particularly important for advertising and network data on columns like
ip-address when filtering, sorting, and using the data for aggregates is not required.
Or perhaps, again in cardinality-shrinking-land, it would be worth checking whether particular filters are always applied in your queries. You could either apply that in the
transformSpec to remove some cardinality, or you could create one dataset with all values, and another with highly-focused filters applied.
If you fancy scripting this kind of work, you can start with this query to get a list of the columns — maybe you could paste this output into a spreadsheet to create a single query with the names of all the columns so you don’t have to do each one manually - or perhaps you fancy writing a Druid data profiler…!
SELECT "COLUMN_NAME" FROM INFORMATION_SCHEMA.COLUMNS WHERE "TABLE_NAME" = 'your-dataset'
When rolled-up data is used, there are substantial query performance and cost benefits - a little research like this up front on the volumes of data Druid often handles can make a disproportionately humongous difference!
This is a community article! If you have suggestions and thoughts, Reply Now and let’s help make this resource better.