Druid indexing job messing up timestamp

Hi everybody,

I am currently running some first tests with Druid within a Hortonworks HDP 2.6.4 distribution (so Druid version 0.10.1) and am encountering a strange behaviour.

I have data in a textfile (CSV) in HDFS with read access for druid technical user having one timestamp column in the format YYYY-MM-DD HH.mm.ss.s with data more or less evenly distributed over about 5 years, some hundred million rows, the textfile has something between 10-30 GB.

When running a static indexing job, all of the timeseries data gets mapped onto time intervals not matching the data source, e.g. in come configuration, it appears that every timestamp gets mapped onto the january of the corresponding year. This also happens if I only ingest a smaller time interval. My ingestion specification JSON:

{
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “hadoop”,
“inputSpec”: {
“type”: “static”,
“paths”: “/path/to/my/textfile.csv”
}
},
“dataSchema”: {
“dataSource”: “datasource_name”,
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “month”,
“queryGranularity”: “none”,
“intervals”: [
“2013-12-01/2018-09-01”
]
},
“parser”: {
“type”: “hadoopyString”,
“parseSpec”: {
“format”: “csv”,
“timestampSpec”: {
“format”: “YYYY-MM-DD HH:mm:ss.s”,
“column”: “timestamp”
},
“columns”: [
“column1”,

“columnN”

      ],
      "dimensionsSpec": {
        "dimensions": [
        "dimension1",
         ...........
        "dimensionM"
        ]
      }
    }
  },
  "metricsSpec": [
    {
      "name": "metric1",
      "type": "count"
    },
    ....................
  ]
},
"tuningConfig": {
  "type": "hadoop",
  "partitionsSpec": {
    "type": "hashed",
    "targetPartitionSize": 100000000
  },
  "jobProperties": {
        "mapredduce.map.memory.mb" : "8192",
        "mapreduce.reduce.memory.mb": "32768"
  }
}

}
}

After successfully loading the data to Druid with an HTTP POST command, the Druid Console shows shards only for the january dates of the corresponding year. After registering the Druid datasource in Hive LLAP CLI, a SELECT DISTINCT on the __time column shows that all timestamps appear to have been mapped onto dates in January.

Am I missing something? E.g. in defining Partition and Shard Sizes?

.

druid_console_screenshot_shards.png

Can you try ingesting that input file but adjust it so that the January data is excluded, and see if any rows are ingested? Maybe the non-January rows are getting discarded (maybe timestamp parsing issues?)

Thanks,

Jon

Hi Jon,

thanks so much for your help.

Indeed when trying to ingest data within a time window not containing January data, indexing jobs fail:

[…]
2018-10-01T15:19:42,321 INFO [task-runner-0-priority-0] io.druid.indexer.DetermineHashedPartitionsJob - Job completed, loading up partitions for intervals[Optional.of([2013-11-30T00:00:00.000Z/2013-12-31T00:00:00.000Z])].
2018-10-01T15:19:42,366 INFO [task-runner-0-priority-0] io.druid.indexer.DetermineHashedPartitionsJob - Found approximately [0] rows in data.
2018-10-01T15:19:42,367 INFO [task-runner-0-priority-0] io.druid.indexer.DetermineHashedPartitionsJob - Creating [0] shards
[…]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: No buckets?? seems there is no data to index.
[…]

The CSV file in HDFS looks like this line:

2015-05-08 19:13:52.0,[String],[String],[String],[String],1,0,0,0,0,0,104,5,113,5

and I checked that the date format is YYYY-MM-DD HH:mm:ss.s

When I ingest data from January, it looks like this:

2018-10-01T15:30:52,276 INFO [task-runner-0-priority-0] io.druid.indexer.DetermineHashedPartitionsJob - Job completed, loading up partitions for intervals[Optional.of([2014-01-01T00:00:00.000Z/2014-01-02T00:00:00.000Z])].
2018-10-01T15:30:52,321 INFO [task-runner-0-priority-0] io.druid.indexer.DetermineHashedPartitionsJob - Found approximately [71,235] rows in data.
2018-10-01T15:30:52,322 INFO [task-runner-0-priority-0] io.druid.indexer.DetermineHashedPartitionsJob - Creating [1] shards

and the job finishes with SUCCESS

Any idea why Druid would not accept dates that are not in January?

Hi Jonas,

Your timestamp format should be: yyyy-MM-dd HH:mm:ss.S

See here for reference on the formatting: https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html. Capital “D” is “day of year” so it would only work properly for January (where day of month and day of year are the same).

Hi Gian,

that is a page I apparently should bookmark. With this knowledge, the error is obvious and the behaviour of the ingest job is indeed correct.

Thank you so much for this hint!