Druid Ingest ORC Files

Hi all,

I am trying to load data from an orc file to druid. The data has a column with type timestamp, but I get this error after upload the file:
Unparseable timestamp found!
This is the schema of the input file:

– local_dt: timestamp (nullable = true)

– site_name: string (nullable = true)

– hit_time_gmt: string (nullable = true)

– ip: string (nullable = true)

– date_time: timestamp (nullable = true)

Have attached the input orc file, as well as the indexing json. Any help on this is highly appreciated.

o.druid.java.util.common.parsers.ParseException: Unparseable timestamp found!

at io.druid.data.input.impl.MapInputRowParser.parseBatch(MapInputRowParser.java:75) ~[druid-api-0.12.3.jar:0.12.3]

at io.druid.data.input.impl.StringInputRowParser.parseMap(StringInputRowParser.java:165) ~[druid-api-0.12.3.jar:0.12.3]

at io.druid.data.input.impl.StringInputRowParser.parse(StringInputRowParser.java:148) ~[druid-api-0.12.3.jar:0.12.3]

at io.druid.segment.transform.TransformingStringInputRowParser.parse(TransformingStringInputRowParser.java:57) ~[druid-processing-0.12.3.jar:0.12.3]

at io.druid.data.input.impl.FileIteratingFirehose.nextRow(FileIteratingFirehose.java:81) ~[druid-api-0.12.3.jar:0.12.3]

at io.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:661) ~[druid-indexing-service-0.12.3.jar:0.12.3]

at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:264) ~[druid-indexing-service-0.12.3.jar:0.12.3]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:444) [druid-indexing-service-0.12.3.jar:0.12.3]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:416) [druid-indexing-service-0.12.3.jar:0.12.3]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]

Caused by: java.lang.NullPointerException: Null timestamp in input: {site_name=ORC}

at io.druid.data.input.impl.MapInputRowParser.parseBatch(MapInputRowParser.java:67) ~[druid-api-0.12.3.jar:0.12.3]

… 12 more

2018-12-06T11:01:23,377 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_omni_hit_small_orc_2018-12-06T11:01:18.066Z] status changed to [FAILED].

2018-12-06T11:01:23,380 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_omni_hit_small_orc_2018-12-06T11:01:18.066Z”,

“status” : “FAILED”,

“duration” : 243

}

part-00000-2e4d8468-c3b9-4064-855b-0c60ce6aa15f.snappy.orc (858 Bytes)

omni_orc_small_index.json (1.31 KB)

Hi Shivani,

The Druid ORC extension is currently a “contrib” extension (http://druid.io/docs/latest/development/extensions-contrib/orc.html), but I recently fixed a bug parsing “date” columns and was curious if there was a similar issue with “timestamp” columns, so I downloaded your file to have a look with my debugger.

Luckily there doesn’t appear to be a bug, but also note that I think the ORC extension only supports ‘hadoop_index’ task so you will need to adjust your task spec as detailed in the link to the extension docs.

I adjusted the parser config like so to be able to parse the row in your sample file:

“parser”: {

“type”: “orc”,

“parseSpec”: {

“format”: “timeAndDims”,

“timestampSpec”: {

“column”: “local_dt”,

“format”: “auto”

},

“dimensionsSpec”: {

“dimensions”: [

“local_dt”,

“site_name”,

“hit_time_gmt”,

“ip”

],

“dimensionExclusions”: ,

“spatialDimensions”:

}

},

“typeString”: “struct<local_dt:timestamp,site_name:string,hit_time_gmt:string,ip:string,date_time:timestamp>”

},

The ORC extension is a bit funny in that it the Apache Hive library it is using to parse the file uses this ‘typeString’ schema definition to give columns names, and the order must match the order of the columns in the file. If ‘typeString’ is not set, it is created automatically with all types set to “string” based on the dimensions in the ‘dimensions’ list, which I think can often function correctly, but it’s definitely a bit clearer what is going on when a typeString is defined and filled out correctly.

Good luck!