Batch Ingestion of Parquet using EMR

Hi,

I have add the parquet extension to the druid POC and am trying to ingest some files (s3), i can see that the extension is loaded :

2018-07-26T09:59:48,582 INFO [main] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, directory=‘dist/druid/extensions’, useExtensionClassloaderFirst=false, hadoopDependenciesDir=‘dist/druid/hadoop-dependencies’, hadoopContainerDruidClasspath=‘null’, addExtensionsToHadoopContainer=false, loadList=[druid-histogram, druid-datasketches, druid-kafka-indexing-service, druid-kinesis-indexing-service, druid-parser-route, druid-s3-extensions, druid-hdfs-storage, druid-parquet-extensions, druid-avro-extensions]}]

but when i try to run my EMR job using emr-5.5.0 with 2.7.3 i am getting (from the EMR):

2018-07-26T10:07:49,430 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Task Id : attempt_1532592899049_0006_m_000000_0, Status : FAILED
Error: java.lang.ClassNotFoundException: io.druid.data.input.avro.GenericRecordAsMap
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at io.druid.data.input.parquet.ParquetHadoopInputRowParser.parse(ParquetHadoopInputRowParser.java:62)
	at io.druid.data.input.parquet.ParquetHadoopInputRowParser.parse(ParquetHadoopInputRowParser.java:37)
	at io.druid.data.input.impl.InputRowParser.parseBatch(InputRowParser.java:49)
	at io.druid.segment.transform.TransformingInputRowParser.parseBatch(TransformingInputRowParser.java:50)
	at io.druid.indexer.HadoopDruidIndexerMapper.parseInputRow(HadoopDruidIndexerMapper.java:110)
	at io.druid.indexer.HadoopDruidIndexerMapper.map(HadoopDruidIndexerMapper.java:68)
	at io.druid.indexer.DetermineHashedPartitionsJob$DetermineCardinalityMapper.run(DetermineHashedPartitionsJob.java:283)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Missing something ?

Thank you,

Alon

Hi Alon,

Please try checking if the versions of your extensions are all lined up. At one point (apologies, I don’t remember off the top of my head) the GenericRecordAsMap class was moved from the Avro extension to the Parquet extension. It looks like you might have an Avro extension without this class, and an older Parquet extension that expects to find it in the Avro extension.