Java IOexception while Ingesting parquet files

Hi all ,

I have single node Ec2 instance in AWS and I am facing below error while I am trying to ingest parquet files in to druid service in the Imply on Prem on that Ec2 machine.

2019-04-09T10:56:50,093 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.HadoopIndexTask - Got invocation target exception in run(), cause:

java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3

2019-04-09T10:56:50,105 INFO [task-runner-0-priority-0] org.apache.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_lx_activities_catalog_clean_2019-04-09T10:56:45.078Z] status changed to [FAILED].

2019-04-09T10:56:50,106 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_hadoop_lx_activities_catalog_clean_2019-04-09T10:56:45.078Z”,

“status” : “FAILED”,

“duration” : 1885,

“errorMsg” : “java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3\n\tat com.google.common…”

}

Here is how my index.json file looks like :-

{

“type”: “index_hadoop”,

“spec”: {

“dataSchema”: {

“dataSource”: “lx_activities_catalog_clean”,

“parser”: {

“type”: “parquet”,

“parseSpec”: {

“format”: “timeAndDims”,

“timestampSpec”: {

“format”: “iso”,

“column”: “etl_load_datetm”

},

“dimensionsSpec”: {

“dimensions”: [

“supplier_name” ,

“supplier_id” ,

“supplier_branch_name” ,

“supplier_branch_id” ,

“activity_internal_name”,

“activity_root_id” ,

“activity_rank” ,

“supplier_override_name” ,

“offer_internal_name” ,

“offer_root_id” ,

“geo_id”,

“region_name” ,

“internal_category_name” ,

“catalog_name”,

“booking_cut_off_hours” ,

“refund_blackout_period” ,

“version” ,

“is_qr_pass_eligible” ,

“is_hotel_pickup_enabled” ,

“third_party_supply”

]

}

}

},

“metricsSpec”: ,

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “day”,

“queryGranularity”: “none”,

“intervals”: [

“2019-04-08/2019-04-09”

],

“rollup”: false

}

},

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“inputFormat”: “org.apache.druid.data.input.parquet.DruidParquetInputFormat”,

“paths”: “s3://…/foldername/”

}

},

“tuningConfig”: {

“type”: “hadoop”,

“partitionsSpec”: {

“targetPartitionSize”: 5000000

},

“jobProperties”: {

“mapreduce.job.user.classpath.first”: “true”,

“fs.s3n.awsAccessKeyId”: “”,

“fs.s3n.awsSecretAccessKey”: “<secretkey”,

“fs.s3.awsAccessKeyId”: “”,

“fs.s3.awsSecretAccessKey”: “<secretkey”,

“io.compression.codecs”: “org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.SnappyCodec”,

“mapreduce.map.java.opts”: “-server -Xmx1536m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps”,

“mapreduce.reduce.java.opts”: “-server -Xmx2560m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps”

}

}

}

}

and I have included druid-avro-extensions , druid-s3-extensions , and druid-parquet-extensions in my load extensions in common properties file

**Can anyone please suggest me if I am missing something here?**

``

Thanks,

Anoosha

Hi ,

Bumping this question for help

Thanks,

Anoosha

Hey Anoosha,

Does it work if you try s3n instead of s3? (Check the instructions on https://docs.imply.io/cloud/manage-data/emr).

You might also need to make sure to include the hadoop-aws jar, if it’s not already in your Druid installation’s hadoop-dependencies directory.

Gian