__time does not match the timestamp

Hello,

I’m trying to find the min timestamp out of a set of data that I have, I’ve found __time looking through the documentation, but it looks like the function returns incorrect values

This is my query, only grouping by timestamp and doing the min(__time), which should return me twice the same value

{

“intervals” : [ “P1D/2015-05-09T13:00:00.000” ],

“queryType” : “groupBy”,

“dataSource” : “impressions”,

“granularity” : “all”,

“dimensions” : [ “timestamp” ],

“aggregations” : [ {

“type” : “min”,

“name” : “startTime”,

“fieldName” : “__time”

} ],

“filter”: { … }

“postAggregations” :

}

These are my results :

[{“event”:{“startTime”:1431176085504,“timestamp”:“1431176129000”}},

{“event”:{“startTime”:1431176216576,“timestamp”:“1431176164000”}},

{“event”:{“startTime”:1431176216576,“timestamp”:“1431176173000”}},

{“event”:{“startTime”:1431176216576,“timestamp”:“1431176179000”}},

{“event”:{“startTime”:1431176216576,“timestamp”:“1431176232000”}},

{“event”:{“startTime”:1431176216576,“timestamp”:“1431176233000”}},

{“event”:{“startTime”:1431176216576,“timestamp”:“1431176239000”}},

{“event”:{“startTime”:1431176216576,“timestamp”:“1431176254000”}},

{“event”:{“startTime”:1431176216576,“timestamp”:“1431176278000”}},

{“event”:{“startTime”:1431176347648,“timestamp”:“1431176288000”}},

{“event”:{“startTime”:1431176347648,“timestamp”:“1431176304000”}},

{“event”:{“startTime”:1431176347648,“timestamp”:“1431176352000”}}]

So, obviously, when I remove the timestamp dimension, I end up with a min __time of 1 431 176 085 504 instead of the expected 1 431 176 129 000

The timestamp seems to be grouped into 3 different __time values, ending with : …085504, …216576, and …347648 (almost like it was grouped that way)

Is there a reason why there is a difference between __time and timestamp, if so, can I do anything to have the correct timestamp in my min value?

If not, is there anywhere I can open a bug for this to be resolved?

Note : I’m using one of the latest version of druid (0.7)

Thanks

Simon

Hi Simon,

Are you trying to get the max and min timestamp of your data? If so, I suggest looking at the timeBoundary query (http://druid.io/docs/latest/TimeBoundaryQuery.html).

If not, can you post the SQL query you are trying to issue?

Hey Fangjin,

I know about timeBoundaries, but I’m trying to get something a little bit more complex, I’ve simplified the query when I originally posted it, I removed my grouping

My query would look like this : (Note : I am not fluent in Druid SQL, so it might look like MySQL)

SELECT play-id, MIN(__time), MAX(__time)

FROM impressions

WHERE timestamp >= ‘2015-05-09’

AND timestamp < ‘2015-05-10’

GROUP BY play-id

With the example I offered above, the timestamp for a play-id would be in ‘2015-05-09 00:00:10’, but __time would give me a time like ‘2015-05-08 23:59:58’ shifting it in the wrong day (same with the MAX obviously)

Hi Simon, at present, this functionality is not supported. The code would need to be added on your side.

Alright,

But could you please explain why the __time method returns a different long than the timestamp?

I can work around the functionality, but I need those 2 to match

Simon, can you provide the spec you used to ingest the data?

--Eric

Hi, sorry about the delay, other issues took priorities :stuck_out_tongue:

Here is the ingestion spec that we use (stripped a bit) for easier view, I’ve attached the full spec too just in case

{

“dataSchema” : {

“dataSource” : “impressions”,

“parser” : {

“type”: “string”,

“parseSpec” : {

“format”: “json”,

“timestampSpec” : {

“column” : “timestamp”,

“format” : “millis”

},

“dimensionsSpec” : {

“dimensions” : [

“ad-id”, “ad-type”, “city”, “country”, “cookie”, “duration”,

“latitude”, “listener-count”, “longitude”, “play-id”, “region”,

“timestamp”, “uuid”

],

“dimensionExclusions” : ,

“spatialDimensions”: [{ “dimName”: “coordinates”, “dims”: [“latitude”, “longitude”] }]

}

}

},

“metricsSpec” : [

{ “type” : “hyperUnique”, “name” : “uuid-reach”, “fieldName” : “uuid” },

{ “type” : “hyperUnique”, “name” : “spot-count”, “fieldName” : “play-id” }

],

“granularitySpec” : {

“type” : “uniform”,

“segmentGranularity” : “hour”,

“queryGranularity” : “none”,

“intervals” : [ “2015-06-01/P2D” ]

}

},

“ioConfig” : {

“type” : “hadoop”,

“inputSpec” : {

“type” : “static”,

“paths” : “$sourceFiles”

},

“metadataUpdateSpec” : {

“type”: “mysql”,

“connectURI” : “$druidSegmentsMysqlDbUri”,

“user” : “$druidSegmentsMysqlDbUser”,

“password” : “$druidSegmentsMysqlDbPassword”,

“segmentTable” : “$druidSegmentsMysqlDbTable”

},

“segmentOutputPath”: “$druidSegmentsOutputPath”

},

“tuningConfig” : {

“type” : “hadoop”,

“workingPath” : “$hdfsUri/tmp”,

“partitionsSpec” : {

“type”: “hashed”,

“numShards”: 4

},

“leaveIntermediate” : false,

“cleanupOnFailure” : true,

“overwriteFiles” : false,

“ignoreInvalidRows” : false,

“combineText” : false,

“persistInHeap” : false,

“ingestOffheap” : false,

“bufferSize” : 134217728,

“aggregationBufferRatio” : 0.5,

“rowFlushBoundary” : 300000

}

}

FullSpec.txt (4.33 KB)

The line

"queryGranularity" : "none",

Is telling druid to not roll up the data on any time boundary, so it
is actually storing data down to the millisecond. Thus, when you do
your query, you are getting out the earliest timestamp that existed on
an event with the relevant "timestamp" column value. Set that to
"hour" and it'll start bunching things up a bit more like what you are
expecting.

However, from these lines:

{"event":{"startTime":1431176216576,"timestamp":"1431176179000"}},
{"event":{"startTime":1431176216576,"timestamp":"1431176232000"}},
{"event":{"startTime":1431176216576,"timestamp":"1431176233000"}},
{"event":{"startTime":1431176216576,"timestamp":"1431176239000"}},
{"event":{"startTime":1431176216576,"timestamp":"1431176254000"}},

It looks like you had an event representing hour 1431176233000 that
came into the system at time 1431176216576, so it might not line up
exactly how you are hoping.

One question, what is the semantics of your timestamp column?
Depending on what you are trying to do, it might make more sense to
give that to Druid as the time column instead of the other column you
are currently providing.

--Eric

Hey Eric,

I actually want to keep the granularity to none, as I sometimes want to know the exact second/minute when the data started coming in (based on filters)

So that’s actually required (Unless there is a granularity “second” which could be used too)

The timestamp column is in millis format, but with the millis stipped to the nearest second (thus why they all end in 000) and I believe we already are sending it as the time column, as it was written in my spec :

“timestampSpec” : {

“column” : “timestamp”,

“format” : “millis”

},

Unless I missed something and that I needed to specify the time another way

I’m not sure where the __time function is adding millis (see 1431176216576)

Ah, ok, I know what's happening. The min aggregator appears to be
running the actual timestamp value through a "float" before storing it
as a double. This is due to legacy reasons.

I forget exactly which version it was introduced in, but given that
you are using a relatively new version, try doing the query with a
"longMin" instead of "min". That should produce good results.

Assuming that is true, I would recommend removing the "timestamp"
column from the dimensions list. Also, you can set queryGranularity
to "second" and it would truncate to the second for you (though it
seems like you are already doing that).

--Eric

longMin is exactly what I needed, __time is now equals to timestamp

I’ll be testing the queryGranularity “second” too shortly

Thanks a lot Eric, you made my day :smiley: