Difference between duration and period query granularities

Hey, guys, can someone help explain the difference between a duration and period granularity? Using the Wikipedia dataset from the tutorial plus a few additional data points from years 2014 and 2015, I’m executing two queries that I had thought should return identical results, but the timestamps are not what I expect. What I’m really looking for is a general way, when given two arbitrary time bounds, to break that interval up into N equally sized buckets

Here’s the dataset:

{“timestamp”: “2013-08-31T01:02:33Z”, “page”: “Gypsy Danger”, “language” : “en”, “user” : “nuclear”, “unpatrolled” : “true”, “newPage” : “true”, “robot”: “false”, “anonymous”: “false”, “namespace”:“article”, “continent”:“North America”, “country”:“United States”, “region”:“Bay Area”, “city”:“San Francisco”, “added”: 57, “deleted”: 200, “delta”: -143}

{“timestamp”: “2013-08-31T03:32:45Z”, “page”: “Striker Eureka”, “language” : “en”, “user” : “speed”, “unpatrolled” : “false”, “newPage” : “true”, “robot”: “true”, “anonymous”: “false”, “namespace”:“wikipedia”, “continent”:“Australia”, “country”:“Australia”, “region”:“Cantebury”, “city”:“Syndey”, “added”: 459, “deleted”: 129, “delta”: 330}

{“timestamp”: “2013-08-31T07:11:21Z”, “page”: “Cherno Alpha”, “language” : “ru”, “user” : “masterYi”, “unpatrolled” : “false”, “newPage” : “true”, “robot”: “true”, “anonymous”: “false”, “namespace”:“article”, “continent”:“Asia”, “country”:“Russia”, “region”:“Oblast”, “city”:“Moscow”, “added”: 123, “deleted”: 12, “delta”: 111}

{“timestamp”: “2013-08-31T11:58:39Z”, “page”: “Crimson Typhoon”, “language” : “zh”, “user” : “triplets”, “unpatrolled” : “true”, “newPage” : “false”, “robot”: “true”, “anonymous”: “false”, “namespace”:“wikipedia”, “continent”:“Asia”, “country”:“China”, “region”:“Shanxi”, “city”:“Taiyuan”, “added”: 905, “deleted”: 5, “delta”: 900}

{“timestamp”: “2013-08-31T12:41:27Z”, “page”: “Coyote Tango”, “language” : “ja”, “user” : “cancer”, “unpatrolled” : “true”, “newPage” : “false”, “robot”: “true”, “anonymous”: “false”, “namespace”:“wikipedia”, “continent”:“Asia”, “country”:“Japan”, “region”:“Kanto”, “city”:“Tokyo”, “added”: 1, “deleted”: 10, “delta”: -9}

{“timestamp”: “2013-08-31T12:42:27Z”, “page”: “J”, “language” : “en”, “user” : “shodan”, “unpatrolled” : “true”, “newPage” : “false”, “robot”: “true”, “anonymous”: “false”, “namespace”:“wikipedia”, “continent”:“North America”, “country”:“US”, “region”:“The South”, “city”:[“Houston”, “Tokyo”], “added”: 1, “deleted”: 10, “delta”: -9}

{“timestamp”: “2013-08-31T13:42:27Z”, “page”: “J”, “language” : “en”, “user” : “shodan”, “unpatrolled” : “true”, “newPage” : “false”, “robot”: “true”, “anonymous”: “false”, “namespace”:“wikipedia”, “continent”:“North America”, “country”:“US”, “region”:“The South”, “city”:[“Phoenix”, “Tokyo”], “added”: 1, “deleted”: 10, “delta”: -9}

{“timestamp”: “2014-07-15T08:45:27Z”, “page”: “J1”, “language” : “en”, “user” : “shodan”, “unpatrolled” : “true”, “newPage” : “false”, “robot”: “true”, “anonymous”: “false”, “namespace”:“wikipedia”, “continent”:“North America”, “country”:“US”, “region”:“The South”, “city”:[“Phoenix”, “Tokyo”], “added”: 1, “deleted”: 10, “delta”: -23}

{“timestamp”: “2015-02-08T10:41:27Z”, “page”: “J2”, “language” : “en”, “user” : “shodan”, “unpatrolled” : “true”, “newPage” : “false”, “robot”: “true”, “anonymous”: “false”, “namespace”:“wikipedia”, “continent”:“North America”, “country”:“US”, “region”:“The South”, “city”:[“Phoenix”, “Tokyo”], “added”: 1, “deleted”: 10, “delta”: -42}

Here’s my query using a period granularity, rolling up results into 1 year chunks since the unix epoch (let’s say I don’t know when my data was created and I want to just find everything possible)

{

“queryType”: “timeseries”,

“dataSource”: “wikipedia”,

“intervals”: [

"1970-01-01T00:00:00.000Z/2016-01-01T00:00:00.000Z"

],

“granularity”: {

"type": "period",

"period": "P1Y",

"origin": "1970-01-01T00:00:00.000Z"

},

“aggregations”: [

{

  "name": "count",

  "type": "count"

},

{

  "name": "delta",

  "type": "doubleSum",

  "fieldName": "delta"

}

]

}

Results in:

[ {

“timestamp” : “2013-01-01T00:00:00.000Z”,

“result” : {

"count" : 7,

"delta" : 1171.0

}

}, {

“timestamp” : “2014-01-01T00:00:00.000Z”,

“result” : {

"count" : 1,

"delta" : -23.0

}

}, {

“timestamp” : “2015-01-01T00:00:00.000Z”,

“result” : {

"count" : 1,

"delta" : -42.0

}

} ]

Now for the duration query:

{

“queryType”: “timeseries”,

“dataSource”: “wikipedia”,

“intervals”: [

"1970-01-01T00:00:00.000Z/2016-01-01T00:00:00.000Z"

],

“granularity”: {

"type": "duration",

"duration": 31536000000,

"origin": "1970-01-01T00:00:00.000Z"

},

“aggregations”: [

{

  "name": "count",

  "type": "count"

},

{

  "name": "delta",

  "type": "doubleSum",

  "fieldName": "delta"

}

]

}

results in:

[ {

“timestamp” : “2012-12-21T00:00:00.000Z”,

“result” : {

"count" : 7,

"delta" : 1171.0

}

}, {

“timestamp” : “2013-12-21T00:00:00.000Z”,

“result” : {

"count" : 1,

"delta" : -23.0

}

}, {

“timestamp” : “2014-12-21T00:00:00.000Z”,

“result” : {

"count" : 1,

"delta" : -42.0

}

} ]

I’m assuming that the difference in results has something to do with using milliseconds for the duration, which doesn’t account for leap time. However, I’m still confused as to why the timestamps I see are from 2012. Thanks!

–T

Hi T,

Yes you are correct the difference is because of using milliseconds. When you specified milliseconds equivalent to a year you are assuming that each year has 365 days however there are 11 leap years since 1970. Therefore while producing the results Druid will try to truncate the start time of an interval as per the duration and all the operations are done by converting the timestamps to milliseconds, thus the it will truncate the timestamps to the corresponding years-11days.

-Parag

Ah, that makes sense, thanks!
–T