How and on what parameters Druid Decide to Roll Up

Hi All,

I want to understand on what factors and parameters druid decides to roll up two entries . I assume Druid must be calculating something similar to hash to figure out the equality between two rows and then decide to roll up accordingly.So how is that Hash Calculated.

I am sharing my use with example below :

I am doing a batch ingestion into my druid (Version : druid-0.9.0, 0.9.1.1) :

Following is my batch Ingestion Script:

{

“type”: “index_hadoop”,

“spec”: {

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“paths”: “MyEventListFile.json”

}

},

“dataSchema”: {

“dataSource”: “<<DATASOURCE_NAME>>>”,

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“column”: “timestamp”,

“format”: “millis”

},

“dimensionsSpec”: {

“dimensions” :

[

A,B,C,D

],

“dimensionExclusions” :

}

}

},

“metricsSpec” :

[

{ “type” : “longMax”, “name” : “eventCount”, “fieldName”: “count” }

],

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “HOUR”,

“queryGranularity”: “none”,

“intervals”: [“2019-07-18/2019-12-01”]

}

},

“tuningConfig”: {

“type”: “hadoop”

}

}

}

{“timestamp”:“1572557460000”,“A”:“0”,“B”:“Pravesh300000”,“C”:“30000”,“D”:“praveshgmail.com”,“NOTADIMENSION”:“test1”,“count”:1}

{“timestamp”:“1572557460000”,“A”:“0”,“B”:“Pravesh300000”,“C”:“30000”,“D”:“praveshgmail.com”,“NOTADIMENSION”:“test2”,“count”:1}

I have two event which has same timestamp and same values for the dimensions as well (A,B,C,D) BUT the column which I cannot declare as dimension is different (NOTADIMENSION).

I am getting count as 1 in this case but I want count as 2.

Whats the solution here ? Is there anything which I can specify explicitly to tell druid how to roll up, i.e. on what columns and timestamp calculate the hash ?

Hoping to hear back soon as I am blocked on this.

Thanks,

Pravesh Gupta

it is bases on “queryGranularity”.

but in your example you have none so most likely you will not get any rollup change it to other granularities.

Then why roll up is happening in my case.

However , Can I tell Druid that these set of columns are to be considered to calculate equality ? How would my queryGranuality look like in that case ?

Is it possible to tell Druid that If My dimensions and some particular non dimension column are same then do the roll up else don't do it ?

One more follow up question, does timestamp gets in picture in druid roll up.

I couldn't find out much details about how druid do roll up online unfortunately.

Thanks,
Pravesh Gupta

Thanks for the breakdown answer.

One of my use case is ..Two users clicked on Same link at exactly same time .So Events corresponding to these two users happen to have same dimensions and same time.The only thing which differentiate them is unique id which I am ingesting in Druid but not as a dimension column(because of some reason let's say)
. So in this case how should i ensure there is no roll up.
I want my event Count to be 2 .

Hope I have made myself clearer .

Thanks for the answer.
Actually I am still confused on how actually decide to Roll Up.

Basically what all columns does it consider to come to a decision that I have to roll up these two rows ?

I am guessing it considers all the dimension columns and timestamp, Nothing else.

I do have a case when I have a column which is neither dimension nor metric nor timestamp column, but I do want to have value of this column to be consider when Druid decides to roll up the rows. Is it even possible first of all, lets not talk about whether does it make sense or does it have any proper use case .

Please help in here.

Hi

Thanks for the answer.
Actually I am still confused on how actually decide to Roll Up.

Basically what all columns does it consider to come to a decision that I have to roll up these two rows ?

I am guessing it considers all the dimension columns and timestamp, Nothing else.

yes Dimensions and Timestamp. Please look at this example http://druid.io/blog/2013/09/12/the-art-of-approximating-distributions.html it will give you a better idea.

I do have a case when I have a column which is neither dimension nor metric nor timestamp column,

what is the nature of this column ? is it a projection from other columns ?

but I do want to have value of this column to be consider when Druid decides to roll up the rows. Is it even possible first of all, lets not talk about whether does it make sense or does it have any proper use case .

well is this dimension column part of the ingested data ? if so it will be part of the decision.

if you give me more examples about your use case i can answer to this question but to be honest still not getting it sorry :frowning:

Hi All,

We are doing some activity and require understanding of rollUp.

We have use case where we want to rollUp only based on timestamp and one particular dimension not all dimensions . Is this possible ?

In my data source I have many dimension and metrics but I want rollUp to happen based on only one dimension value and timestamp ofcourse.

Anything to do with druid version , Currently we are running 0.9.1.1 in production which I guess does not support this as Slim confirmed in above mail.

Hi Pravesh,

Can you elaborate more on how you want to interpret/implement rollup for other dimensions when rolling up rows based on only one dimension value ?

Hi Nishant,
We have following requirement :

Assume these are the following two events we received at Druid :

Event 1 is received and then Event 2 , both these events are having same value for rollUpDim (‘GUID’) , and we want to rollup on this dimension .

Event 1 : {timestamp:123, rollUpDim : “GUID”, dim1 : “111”, dim2:“444”, dim3:“666”, metricCount: 1}

Event 2: {timestamp:128,rolUpDim: “GUID”, dim1 : “222”, dim2:“444”, dim4: “777”, metricCount: 1}

Following is how we want to roll up these (Strategy : LongMax) :

{{timestamp:128,rolUpDim: “GUID”, dim1 : “222”, dim2:“444”, dim4: “777”, dim3: “666”, metricCount: 1}}

We kind of merged these rows .

Hope it is clear.

We are also worried for this approach to be possible in Kafka Indexing Service Windowless Ingestion.

In this case aren’t dim1, dim2, dim3 and dim4 metrics that you rollup using LongMax? Or do you want to later filter based on these in your queries?

Yeah we want to filter based on these in our queries.

RollUp decision we want only for rollUpDim Column .