Druid count differ when we run same query on daliy and row data

Hi,

  1. When i run query to ABS Data Source .

{

“queryType” : “groupBy”,

“dataSource” : “ABS”,

“granularity” : “all”,

“intervals” : [ “2018-07-12T00:00:00.000Z/2018-07-13T00:00:00.000Z” ],

“descending” : “false”,

“aggregations” : [ {

“type” : “count”,

“name” : “COUNT”,

“fieldName” : “COUNT”

} ],

“postAggregations” : ,

“dimensions” : [ “event_id” ]

}

  1. below json used for submit Daily job for druid which will create segments for ABS_DALIY for specific time

{

“spec”: {

“ioConfig”: {

“firehose”: {

“dataSource”: “ABS”,

“interval”: “2018-07-12T00:00:00.000Z/2018-07-13T00:00:00.000Z”,

“metrics”: null,

“dimensions”: null,

“type”: “ingestSegment”

},

“type”: “index”

},

“dataSchema”: {

“granularitySpec”: {

“queryGranularity”: “day”,

“intervals”: [

“2018-07-12T00:00:00.000Z/2018-07-13T00:00:00.000Z”

],

“segmentGranularity”: “day”,

“type”: “uniform”

},

“dataSource”: “ABS_DAILY”,

“metricsSpec”: ,

“parser”: {

“parseSpec”: {

“timestampSpec”: {

“column”: “server_timestamp”,

“format”: “dd MMMM, yyyy (HH:mm:ss)”

},

“dimensionsSpec”: {

“dimensionExclusions”: [

“server_timestamp”

],

“dimensions”:

},

“format”: “json”

},

“type”: “string”

}

}

},

“type”: “index”

}

  1. I quired to ERS_DAILY with below it return different result.

{

“queryType” : “groupBy”,

“dataSource” : “ERS_DAILY”,

“granularity” : “all”,

“intervals” : [ “2018-07-12T00:00:00.000Z/2018-07-13T00:00:00.000Z” ],

“descending” : “false”,

“aggregations” : [ {

“type” : “count”,

“name” : “COUNT”,

“fieldName” : “COUNT”

} ],

“postAggregations” : ,

“dimensions” : [ “event_id” ]

}

Why this different result count happen . As count mismatch causing big issue ?

Regards,

Sudhanshu lenka

do you sure your source json files are never changed? or your daily batch task is already finished while you are querying? I think if the ingesting job has finished, it is impossible to get different result

Sudhanshu Lenka sudhanshu.lenka2008@gmail.com 于2018年7月13日周五 下午11:11写道:

Hi Frank,

Thanks for your reply.

Case 1 : When we are querying , we are 100% sure ingestion job completed successfully.

Case 2 : we dont change source json file usually.

Another case also i faced issue for different count in ABS_DAILY and ABS_MONTHLY , While we form ABS_MONTHLY from ABS_DAILY.

Is their possible case if same timestamp with same data (Which is duplicate while we are consuming from kafka ) , is replaced by druid when we form DAILY or Monthly Job ?

How can possible ABS_DAILY count is less and ABS_MONTHLY count is high . With difference around 1cr ?

Is their any thing we are missing while submitting job for Daily or Monthly ?

Regards,

Sudhanshu Lenka

Hi Frank,

I have another query ,

Is it right to fire “count” query in ABS_DAILY and ABS_MONTHLY , Because i don’t have any metrics field so we go for count query on ABS_DAILY and ABS_MONTHLY or do you need dummy metrics to get right result for record count.?

Regards,

Sudhanshu Lenka