Supporting ingesting parquet timestamp field stored as INT96

Hello,

I am trying to use the druid-parquet-extensions to ingest parquet data in druid. As per my understanding parquet uses INT96 as the datatype for timestamps:

optional int96 logged_at

Trying to read this parquet files in hadoop index job throws an error that INT96 not supported. It looks like INT96 field may need to be read as an byte array, has anyone seen similar error, suggestion?

I am using druid 0.11.0, here is the detailed stack trace:

Caused by: java.lang.IllegalArgumentException: INT96 not yet implemented.

at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) ~[parquet-avro-1.8.2.jar:0.11.0.1]

at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264) ~[parquet-avro-1.8.2.jar:0.11.0.1]

at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223) ~[parquet-column-1.8.2.jar:1.8.2]

at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:263) ~[parquet-avro-1.8.2.jar:0.11.0.1]

at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:241) ~[parquet-avro-1.8.2.jar:0.11.0.1]

at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231) ~[parquet-avro-1.8.2.jar:0.11.0.1]

at org.apache.parquet.avro.DruidParquetReadSupport.prepareForRead(DruidParquetReadSupport.java:98) ~[druid-parquet-extensions-0.11.0.1.jar:0.11.0.1]

at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:175) ~[parquet-hadoop-1.8.2.jar:1.8.2]

at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:190) ~[parquet-hadoop-1.8.2.jar:1.8.2]

at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147) ~[parquet-hadoop-1.8.2.jar:1.8.2]

at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]

at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]

at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.7.3.2.5.3.0-37.jar:?]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_141]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_141]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_141]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_141]

at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_141]

I created an issue under druid: https://github.com/druid-io/druid/issues/5150

I would look into fixing it, would be nice if someone who has worked on it can share thoughts how best to approach it. (Also if there is any workaround)