experiment with duplicate rows

Hi,
I inserted data using 1000 lines of duplicated rows.

header:

uniqueid, timestamp, appid, geocode, dimension_count

4 lines:

191886,1430872823,886,BBB,1

191886,1430872823,886,BBB,1

191886,1430872823,886,BBB,1

191886,1430872823,886,BBB,1

I then run a groupBy query and I get this result:

[ {

“version” : “v1”,

“timestamp” : “2015-05-01T00:00:00.000Z”,

“event” : {

“count” : 1,

“geo_country” : “BBB”

}

} ]

This result tells me that only one row appears to be consumed.

I then run a select query, and I get this:

[ {

“timestamp” : “2015-05-06T00:00:00.000Z”,

“result” : {

“pagingIdentifiers” : {

“events2_2015-05-06T00:00:00.000Z_2015-05-07T00:00:00.000Z_2015-06-02T11:52:05.138Z” : 0

},

“events” : [ {

“segmentId” : “events2_2015-05-06T00:00:00.000Z_2015-05-07T00:00:00.000Z_2015-06-02T11:52:05.138Z”,

“offset” : 0,

“event” : {

“timestamp” : “2015-05-06T00:40:23.000Z”,

“unqid” : “191886”,

“app_id” : “886”,

“geo_country” : “BBB”,

“count” : 1000.0,

“dimension_count” : 1000.0

}

} ]

}

} ]

I am not sure how to interpret this. I seems to think that there is one row by the count and dimension_count have just been incremented 1000 times.

Can you explain to me this behavior?

I’m trying to understand how Druid deals with duplicated row data.

Thanks.

Johnny

While indexing, druid merges rows together if they have same combination of timestamp(truncated to specified granularity) and dimension columns… metrics are “aggregated” as per the given aggregation type.

Since, all your rows were duplicated, they got merged into 1 single row when indexed and dimensions_count kept getting summed/aggregated and shows you 1000.

See http://druid.io/docs/0.7.3/ for more about details and specifically druid white paper.

– Himanshu