0.10.1 S3A Batch Ingestion Issues

I’m attempting to run batch ingestion with a base 0.10.1 installation. The only extensions loaded are “druid-s3-extensions” and “postgresql-metadata-storage”

When I create an indexing task with an “s3a://<file_url>” inputSpec path and a jobProperties like
“jobProperties”:{

“fs.s3a.impl”:“org.apache.hadoop.fs.s3a.S3AFileSystem”,

“fs.s3a.server-side-encryption-algorithm”:“AES256”,

“fs.s3a.connection.ssl.enabled”:“true”

}

``

, it throws the following error:

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) ~[?:?]
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) ~[?:?]
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) ~[?:?]
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) ~[?:?]
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) ~[?:?]
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) ~[?:?]
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) ~[?:?]
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) ~[?:?]
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) ~[?:?]
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500) ~[?:?]
        at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:110) ~[?:?]
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) ~[?:?]
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) ~[?:?]
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) ~[?:?]
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) ~[?:?]
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) ~[?:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_141]
        at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_141]
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) ~[?:?]
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) ~[?:?]
        at io.druid.indexer.DetermineHashedPartitionsJob.run(DetermineHashedPartitionsJob.java:117) ~[druid-indexing-hadoop-0.10.1.jar:0.10.1]
        at io.druid.indexer.JobHelper.runJobs(JobHelper.java:372) ~[druid-indexing-hadoop-0.10.1.jar:0.10.1]
        at io.druid.indexer.HadoopDruidDetermineConfigurationJob.run(HadoopDruidDetermineConfigurationJob.java:91) ~[druid-indexing-hadoop-0.10.1.jar:0.10.1]
        at io.druid.indexing.common.task.HadoopIndexTask$HadoopDetermineConfigInnerProcessing.runTask(HadoopIndexTask.java:307) ~[druid-indexing-service-0.10.1.jar:0.10.1]

``

I also tried installing hadoop-aws:2.7.3, and that threw the following error on a batch load job:

Caused by: java.lang.NoSuchMethodError: com.amazonaws.AmazonWebServiceRequest.copyPrivateRequestParameters()Ljava/util/Map;
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3506) ~[?:?]
        at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) ~[?:?]
        at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) ~[?:?]
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) ~[?:?]
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) ~[?:?]
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) ~[?:?]
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) ~[?:?]
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) ~[?:?]
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) ~[?:?]
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) ~[?:?]
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500) ~[?:?]
        at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:110) ~[?:?]

``

I’m having the same problem with using S3A with 0.10.1. Let me know if you figure out how to make it work!

can you share the job spec ?

is this on EMR ?

Here’s my index job:

{
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “hadoop”,
“inputSpec”: {
“type”: “static”,
“inputFormat”: “io.druid.data.input.parquet.DruidParquetInputFormat”,
“paths”: “s3a://path/to/data/”
}
},
“dataSchema”: {
“dataSource”: “sf”,
“parser”: {
“type”: “parquet”,
“parseSpec”: {
“format”: “timeAndDims”,
“timestampSpec”: {
“column”: “updateTime”,
“format”: “auto”
},
“dimensionsSpec”: {
“dimensions”: [
{
“type”: “long”,
“name”: “userId”
},
“city”,
“state”,
“zipCode”,
“countryCode”,
{
“type”: “long”,
“name”: “component”
}
],
“dimensionExclusions”: ,
“spatialDimensions”:
}
}
},
“metricsSpec”: [
{
“type” : “thetaSketch”,
“name” : “userId_sketch”,
“fieldName” : “userId”
},
{
“type” : “thetaSketch”,
“name” : “component_sketch”,
“fieldName” : “component”
}
],
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “WEEK”,
“queryGranularity”: “none”,
“intervals”: [“2017-04-01/2017-07-23”]
}
},
“tuningConfig”: {
“type”: “hadoop”,
“partitionsSpec”: {
“targetPartitionSize”: 5000000
},
“jobProperties” : {
“mapreduce.job.user.classpath.first” : true,
“fs.s3.awsAccessKeyId” : xxx,
“fs.s3.awsSecretAccessKey” : xxx,
“fs.s3.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,
“fs.s3n.awsAccessKeyId” : xxx,
“fs.s3n.awsSecretAccessKey” : xxx,
“fs.s3n.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,
“fs.s3a.awsAccessKeyId” : xxx,
“fs.s3a.awsSecretAccessKey” : xxx,
“fs.s3a.impl” : “org.apache.hadoop.fs.s3a.S3AFileSystem”,
},
“leaveIntermediate”: false
}
}
}

``

This is on EMR. Is it still the case that s3a won’t work at all with EMR?

Also, now when I try doing s3n or s3, I get a “java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found” error, which wasn’t happening before I upgraded to 0.10.1. Did I maybe miss loading something in? I have druid-s3-extensions in my load list.

Thanks!

Ryan

My job spec was

{

“type” : “index_hadoop”,

“spec”:{

“dataSchema”:{

“dataSource”:"",

“parser”:{

“type”:“hadoopyString”,

“parseSpec”:{

“format”:“csv”,

“columns”:[],

“timestampSpec”:{

“column”:“timestamp”,

“format”:“auto”

},

“dimensionsSpec”:{

“dimensions”:[]

}

}

},

“metricsSpec”:[],

“granularitySpec”:{

“type”:“uniform”,

“segmentGranularity”:“YEAR”,

“queryGranularity”:{

“type”:“none”

},

“rollup”:false,

“intervals”:[

“2012-07-06T19:23:37.000Z/2017-06-16T20:38:36.000Z”

]

}

},

“ioConfig”:{

“type”:“hadoop”,

“inputSpec”:{

“type”:“static”,

“paths”:“s3a:///.csv.gz”

},

“metadataUpdateSpec”:null,

“segmentOutputPath”:null

},

“tuningConfig”:{

“type”:“hadoop”,

“workingPath”:null,

“version”:“2017-08-24T21:48:23.199Z”,

“partitionsSpec”:{

“type”:“hashed”,

“targetPartitionSize”:500000,

“maxPartitionSize”:750000,

“assumeGrouped”:false,

“numShards”:-1,

“partitionDimensions”:[

]

},

“shardSpecs”:{

},

“indexSpec”:{

“bitmap”:{

“type”:“concise”

},

“dimensionCompression”:“lz4”,

“metricCompression”:“lz4”,

“longEncoding”:“longs”

},

“maxRowsInMemory”:50000,

“leaveIntermediate”:false,

“cleanupOnFailure”:true,

“overwriteFiles”:true,

“ignoreInvalidRows”:false,

“jobProperties”:{

“fs.s3a.impl”:“org.apache.hadoop.fs.s3a.S3AFileSystem”,

“fs.s3a.server-side-encryption-algorithm”:“AES256”,

“fs.s3a.connection.ssl.enabled”:“true”,

“mapreduce.job.classloader”: “true”,

“mapreduce.job.user.classpath.first”: “true”,

“mapreduce.task.timeout”:“1800000”,

“mapreduce.job.maps”:“1”,

“mapreduce.job.reduces”:“1”,

“mapreduce.map.memory.mb”:“256”,

“mapreduce.map.java.opts”:"-server -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8",

“mapreduce.reduce.memory.mb”:“256”,

“mapreduce.reduce.java.opts”:"-server -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8",

“mapreduce.map.output.compress”:“true”,

“mapred.map.output.compress.codec”:“org.apache.hadoop.io.compress.SnappyCodec”

},

“combineText”:false,

“useCombiner”:false,

“buildV9Directly”:true,

“numBackgroundPersistThreads”:0,

“forceExtendableShardSpecs”:false,

“useExplicitVersion”:false,

“allowedHadoopPrefix”:[

]

}

},

“dataSource”:"",

“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.7.3”, “org.apache.hadoop:hadoop-aws:2.7.3”]

}

``

This is not running on EMR. Just an Amazon Linux EC2 machine.

Robert, were you able to get past this issue?

@Johnson. I have not been able to get past this. I spend a whole day on it and got nowhere. I was able to get s3n to work again though by explicitly including “hadoop-aws:2.7.3” in my hadoopDependencies with a base 0.10.1 installation.

Thanks, Robert - I ended up getting this when I reverted to that hadoop dependency in my ingestion spec:

2017-09-05T21:36:16,114 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_ip_queries-201, type=index_hadoop, dataSource=ip_queries}]
io.druid.java.util.common.ISE: Hadoop dependency [/opt/druid/druid-0.10.1-rc3/hadoop-dependencies/hadoop-aws/2.7.3] didn't exist!?
        at io.druid.initialization.Initialization.getHadoopDependencyFilesToLoad(Initialization.java:274) ~[druid-server-0.10.1-rc3.jar:0.10.1-rc3]
        at io.druid.indexing.common.task.HadoopTask.buildClassLoader(HadoopTask.java:156) ~[druid-indexing-service-0.10.1-rc3.jar:0.10.1-rc3]
        at io.druid.indexing.common.task.HadoopTask.buildClassLoader(HadoopTask.java:130) ~[druid-indexing-service-0.10.1-rc3.jar:0.10.1-rc3]
       

``

Have you seen this one?

Sorry, I should have been more clear.

My hadoopDependencyCoordinates are

“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.7.3”, “org.apache.hadoop:hadoop-aws:2.7.3”]

``

I managed to fix this issue by adding the hadoop-aws jar to the hadoop dependencies. I was getting the following error during hadoop batch ingestion:

java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

I fixed it by adding hadoop-aws-2.7.3.jar (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3) to hadoop-dependencies/hadoop-client/2.7.3/hadoop-aws-2.7.3.jar

@Lawrence. Were you able to make it work with S3A? What region is your bucket in? I was only able to make it work with S3N in v2 regions. I can’t make it work with v4 regions neither S3A, not S3N.

Anyone using hdfs to push the segment to s3 using s3a ?? what I noticed if I use hdfs by default config converter adds path of Hadoop from config
to “segmentOutputPath” : “hdfs://:9000”

druid.storage.type=hdfs
druid.storage.bucket=druid-data
druid.storage.baseKey=druid_
druid.storage.useS3aSchema=True
druid.storage.storageDirectory=s3a://druid-data

~ Biswajit

bump …

Any one have suggestion with S3A with Hadoop 2.7.3 ??

For me it is work with s3n

Hope this helps …

See my comment here for using s3a deep storage:
https://groups.google.com/d/msg/druid-user/i3qK0u5BDGM/iyjShu8EAQAJ