row count too close to rollup count

Hey,

I was manually executing some queries in our druid cluster, and I remarked something interesting concerning our data. I executed this query:

{

“queryType”: “timeseries”,

“aggregations”: [

{

“type”: “count”,

“name”:“c”

}, {

“type”: “longSum”,

“name”: “s”,

“fieldName”: “loglines”

}

],

granularity": “day”,

}

where loglines is the name of the rollup count, defined by

“metricsSpec” : [

{ “type” : “count”, “name” : “loglines” }

]

in our ingestion specs.

In the result s was only slightly larger than c (depending on the day, but about 1-2% more). I interpreted this as a sign of insufficient rollup. Is this something normal, or does this mean that our dimensions have too much cardinality?

Best,

Balazs

Hey Balazs,

Whether or not you have ‘insufficient rollup’ depends on what you are doing with your data (what kind of queries you’re issuing). Some use cases require retaining high cardinality dimensions and having small query granularities in which case low or no rollup is expected. For other use cases, such as when you’re only concerned with the number of unique entries in a high cardinality dimension, you can attain better rollup by using approximation aggregators such as hyperUnique or theta sketches (http://druid.io/docs/0.9.1.1/development/extensions-core/datasketches-aggregators.html). A good starting point for optimizing rollup is to take a look at your ingestion dataSchema and determine, a) what is the minimum query granularity I need, b) are there any columns in the raw data I never use and can exclude, c) are there any dimensions that can be modeled using an approximation algorithm.

Hi Lim,

Thanks for the insights.

As a start, I changed the query granularity from second to
minute, but even if Pivot seems to run smoother, it only
moderately affected the sum/count ratio (it increased from ~1.015
to ~1.075 in average).

Currently we don’t have unique identifiers as dimensions, although
it looks like that the current columns characterizes too much our
data, so we might exclude some of them.

Thanks again,

Balazs