posix vs ISO timestamps

Hi,
I ran some experiments with batch indexing.

I successfully loaded 1000 rows of events in with batch.

I think changed one of the fields, and ran it again, and the data appeared to update just fine.

Then, I changed the timestamp spec from “iso” to “posix” and I tried to overwrite the data again.

It looks like 1000 rows were written with the new data (posix) but 12 rows of the old data (iso) is still there.

I know that this is an edge case, but is there an issue with “posix” timestamps?

Or with mixing timestamps?

Is ISO-8061 the only “guaranteed” timestamp to work?

Here is my task json:

{

“type”: “index”,

“spec”: {

"dataSchema": {

  "dataSource": "events3",

“parser”: {

    "type": "string",

    "parseSpec": {

“format”: “csv”,

      "timestampSpec": {

        "column": "timestamp",

        "format": "posix"

},

      "dimensionsSpec": {

“dimensions”: [

“unqid”,

“app_id”,

“geo_country”

],

        "dimensionExclusions": [],

        "spatialDimensions": []

},

      "listDelimiter" : ",",

      "columns" : ["unqid","timestamp","app_id","geo_country","dimension_count"]

}

},

  "metricsSpec": [

{

“type”: “count”,

      "name": "count"

},

{

      "type": "doubleSum",

      "name": "dimension_count",

      "fieldName": "dimension_count"

}

],

  "granularitySpec": {

    "type": "uniform",

    "segmentGranularity": "DAY",

    "queryGranularity": "NONE",

    "intervals": ["2015-05-01/2015-06-01"]

}

},

"ioConfig": {

  "type": "index",

  "firehose": {

    "type": "local",

    "baseDir": "/home/ubuntu/sandbox/",

    "filter": "shaman_events_data.csv"

}

},

"tuningConfig": {

  "type": "index",

  "targetPartitionSize": 0,

  "rowFlushBoundary": 0

}

}

}

My guess is that it is a timezone issue and that the 12 rows still
there are in a segment that is not in the time range due to timezone
issues.

Are you

1) Running your jobs on machines setup with a timezone of UTC?
2) Running your processes with "-Duser.timezone=UTC"?

Also, can you check your coordinator console and provide the intervals
for which segments exist?

--Eric

Hi Eric,

We running on AWS with UTC time, and with the config -Duser.timezone=UTC

Here is are the segment intervals as per the Coordinator old-console:

localhost:4203
events3
2015-05-05T00:00:00.000Z/2015-05-06T00:00:00.000Z
2015-06-01T19:35:09.678Z
app_id,geo_country,unqid
count,dimension_count
9
2634
events3_2015-05-05T00:00:00.000Z_2015-05-06T00:00:00.000Z_2015-06-01T19:35:09.678Z
localhost:4203
events3
2015-05-06T00:00:00.000Z/2015-05-07T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
3643
events3_2015-05-06T00:00:00.000Z_2015-05-07T00:00:00.000Z_2015-06-01T21:24:33.771Z
localhost:4203
events3
2015-05-07T00:00:00.000Z/2015-05-08T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
4638
events3_2015-05-07T00:00:00.000Z_2015-05-08T00:00:00.000Z_2015-06-01T21:24:33.771Z
localhost:4203
events3
2015-05-08T00:00:00.000Z/2015-05-09T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
4065
events3_2015-05-08T00:00:00.000Z_2015-05-09T00:00:00.000Z_2015-06-01T21:24:33.771Z
localhost:4203
events3
2015-05-09T00:00:00.000Z/2015-05-10T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
4467
events3_2015-05-09T00:00:00.000Z_2015-05-10T00:00:00.000Z_2015-06-01T21:24:33.771Z
localhost:4203
events3
2015-05-10T00:00:00.000Z/2015-05-11T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
4414
events3_2015-05-10T00:00:00.000Z_2015-05-11T00:00:00.000Z_2015-06-01T21:24:33.771Z
localhost:4203
events3
2015-05-11T00:00:00.000Z/2015-05-12T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
6234
events3_2015-05-11T00:00:00.000Z_2015-05-12T00:00:00.000Z_2015-06-01T21:24:33.771Z
localhost:4203
events3
2015-05-12T00:00:00.000Z/2015-05-13T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
6267
events3_2015-05-12T00:00:00.000Z_2015-05-13T00:00:00.000Z_2015-06-01T21:24:33.771Z
localhost:4203
events3
2015-05-13T00:00:00.000Z/2015-05-14T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
7197
events3_2015-05-13T00:00:00.000Z_2015-05-14T00:00:00.000Z_2015-06-01T21:24:33.771Z
localhost:4203
events3
2015-05-14T00:00:00.000Z/2015-05-15T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
9366
events3_2015-05-14T00:00:00.000Z_2015-05-15T00:00:00.000Z_2015-06-01T21:24:33.771Z
localhost:4203
events3
2015-05-15T00:00:00.000Z/2015-05-16T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
9872
events3_2015-05-15T00:00:00.000Z_2015-05-16T00:00:00.000Z_2015-06-01T21:24:33.771Z
localhost:4203
events3
2015-05-16T00:00:00.000Z/2015-05-17T00:00:00.000Z
2015-06-01T21:24:33.771Z
app_id,geo_country,unqid
count,dimension_count
9
5222
events3_2015-05-16T00:00:00.000Z_2015-05-17T00:00:00.000Z_2015-06-01T21:24:33.771Z

Hi Eric,

I think you are right.

The unix timestamp on my data end is being stored as PST.

Let me try to re-run.

Yes, everything is fine. What happened was I was writing the data from Mysql, and when you issue a from_unix() it turns it into local date time instead of preserving the original UTC.
Thanks for the hint :slight_smile: