Parquet Data Ingestion in Druid Error in Timestamp parsing using Joda

Hi,

I am getting an error while parsing datetime from a parquet data.

Posted the issue in stackoverflow

2018-01-18T19:31:52,509 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Task Id : attempt_1516108443547_0022_m_000068_0, Status : FAILED
Error: io.druid.java.util.common.RE: Failure on row[{"t": "2017-09-01 21:14:11:552 IST"}]
    at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:91)
    at io.druid.indexer.DetermineHashedPartitionsJob$DetermineCardinalityMapper.run(DetermineHashedPartitionsJob.java:288)
    ..

Caused by: java.lang.IllegalArgumentException: Invalid format: "2017-09-01 21:14:11:552 IST" is malformed at "IST"
    at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945)
    at io.druid.java.util.common.parsers.TimestampParser.lambda$createTimestampParser$4(TimestampParser.java:93)
    at io.druid.java.util.common.parsers.TimestampParser.lambda$createObjectTimestampParser$8(TimestampParser.java:129)
    . .

My input json is:

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "s3://s3_path"
      }
    },
    "dataSchema": {
      "dataSource": "parquet_test1",
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": ["2017-08-01T00:00:00:000Z/2017-08-02T00:00:00:000Z"]
      },
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "t",
            "format": "yyyy-MM-dd HH:mm:ss:SSS zzz"
          },
          "dimensionsSpec": {
            "dimensions": [
              "dim1","dim2","dim3"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [{
        "type": "count",
        "name": "count"
      },{
          "type" : "count",
          "name" : "pid",
          "fieldName" : "pid"
        }]
    },
    "tuningConfig": {
      "type": "hadoop",
      "partitionsSpec": {
        "targetPartitionSize": 5000000
      },
      "jobProperties" : {
        "mapreduce.job.user.classpath.first": "true",
        "fs.s3.awsAccessKeyId" : "KEYID",
        "fs.s3.awsSecretAccessKey" : "AccessKey",
        "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
        "fs.s3n.awsAccessKeyId" : "KEYID",
        "fs.s3n.awsSecretAccessKey" : "AccessKey",
        "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
        "io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
      },
      "leaveIntermediate": true
    }
  }, "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.7.3", "org.apache.hadoop:hadoop-aws:2.7.3", "com.hadoop.gplcompression:hadoop-lzo:0.4.20"]
}

Possible solution

 1. How to parse "2017-09-01 21:14:11:552 IST" in joda format

 2. Any config to use SimpleDateFormat for parsing date in timestampSpec, as joda library is used default.

Please check and let me know any solution.

The Joda docs state that the "zzz" format cannot be parsed:

http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html

Zone names: Time zone names ('z') cannot be parsed.

Looking at the stackoverflow link you provided, it appears that it's
possible to provide a map of timezone names -> timezone IDs to resolve the
name ambiguity, but this is not supported by Druid.

Thanks,
Jon