Druid Parquet Errors

Hey folks,

Has anyone run into issues loading parquet data from HDFS in Druid 10.1?

I’m seeing indexing tasks failing and the following in the mapreduce logs:

ERROR [main] org.apache.hadoop.mapred.YarnChild - Error running child : java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V

at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:248)

at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231)

at org.apache.parquet.avro.DruidParquetReadSupport.prepareForRead(DruidParquetReadSupport.java:91)

at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:175)

at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:190)

at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147)

at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84)

at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

``

I’ve noticed different extensions use different versions of avro and tried replacing with a consistent version (1.7.7) without success.

Ingestions spec:
{

“type”: “index_hadoop”,

“spec”: {

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“inputFormat”: “io.druid.data.input.parquet.DruidParquetInputFormat”,

“paths”: “/topics/prod/forecast/year=2017/month=10/day=2/hour=22/”

}

},

“dataSchema”: {

“dataSource”: “m.fc”,

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “HOUR”,

“queryGranularity”: “HOUR”,

“intervals”: [

“2017-10-02T22:00:00.000Z/2017-10-02T23:00:00.000Z”

]

},

“parser”: {

“type”: “parquet”,

“parseSpec”: {

“format”: “json”,

“timestampSpec”: {

“format”: “auto”,

“column”: “timestamp”

},

“flattenSpec”: {

“useFieldDiscovery”: true,

“fields”: [

{

“type”: “root”,

“name”: “timestamp”

}

]

},

“dimensionsSpec”: {

“dimensions”: ,

“dimensionExclusions”: ,

“spatialDimensions”:

}

},

“fromPigAvroStorage”: true

},

“metricsSpec”: [

{

“type”: “count”,

“name”: “count”

}

]

},

“tuningConfig”: {

“type”: “hadoop”,

“useCombiner”: “true”,

“buildV9Directly”: “true”,

“partitionsSpec”: {

“type”: “hashed”,

“numShards”: 3

},

“jobProperties”: {

“mapreduce.job.user.classpath.first”: “true”

}

}

},

“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.7.3”]

}

``

Thanks,

Dylan

Hi, I’m seeing a similar issue when trying to run hadoop index tasks for Parquet data for Druid 0.10.1 (this was previously working on a pre-0.10 release). My MR log error looks like:

ERROR [main] org.apache.hadoop.mapred.YarnChild - Error running child : java.lang.NoSuchFieldError: NULL_VALUE
        at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:245)
        at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231)
        at org.apache.parquet.avro.DruidParquetReadSupport.prepareForRead(DruidParquetReadSupport.java:91)
        at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:175)
        at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:190)
        at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147)
        at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:557)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:795)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

``

Were you able to figure this out or get any help? Guessing its a version dependency issue.

Apparently it was caused due to avro incompatibilites (druid 0.10.1 uses avro 1.7.x, while druid-parquet-extensions needs avro 1.8.x).

Resolved it by installing druid-parquet-extensions v.0.10.0, which also uses avro 1.7.x