Indexing on S3: Signature Version 4 support?

Hello,

I have a question about Druid on S3. I am trying to run a batch indexing job, but it is failing to write data to my bucket with following error message:
java.lang.Exception: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Request Error: Failed to automatically set required header “x-amz-content-sha256” for request with entity org.jets3t.service.impl.rest.httpclient.RepeatableRequestEntity@57635fb5

I think the problem might be related to S3 signature version, because my S3 bucket only supports signature version 4.
Is there a setting in Druid to specify which S3 signature version to use?

For reference, I am using druid version 0.9.0 and here is the relevant S3 configuration settings of my setup:

  • in druid-0.9.0/conf/druid/_common/common.runtime.properties:
    druid.extensions.loadList=[“druid-s3-extensions”]

    druid.storage.type=s3
    druid.storage.bucket=<MY_BUCKET>
    druid.storage.baseKey=druid/segments
    druid.s3.accessKey=<MY_ACCESS_KEY>
    druid.s3.secretKey=<MY_SECRET_KEY>

    druid.indexer.logs.type=s3
    druid.indexer.logs.s3Bucket=<MY_BUCKET>
    druid.indexer.logs.s3Prefix=druid/indexing-logs

  • in my index.json file, I added following job properties:
    “jobProperties” : {
    “fs.s3.awsAccessKeyId” : “MY_ACCESS_KEY”,
    “fs.s3.awsSecretAccessKey” : “MY_SECRET_KEY”,
    “fs.s3.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,
    “fs.s3n.awsAccessKeyId” : “MY_ACCESS_KEY”,
    “fs.s3n.awsSecretAccessKey” : “MY_SECRET_KEY”,
    “fs.s3n.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,
    “io.compression.codecs” : “org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec”
    }

Let me know if you need any other info.

Thanks in advance for the help!

Isabelle

I think we need to update our version of jets3t in Druid

It looks like this was fixed in http://www.jets3t.org/RELEASE_NOTES.html

Do you mind submitting a PR that updates jets3t?

Hi Fangjin,

It looks like you guys are already on the latest version for jets3t 0.9.4, doesn’t it?

Thanks,

Isabelle

Hopefully an update to jets3t won’t upset hadoop/spark indexers too much

One thing to note is that the indexing logs are loaded to my S3 bucket without issue.
What is the difference between the logs push to S3 vs. the index push to S3? Do they both use jets3t library?

Thanks,

Isabelle

Hmm, there should be no difference as they are using the same version of jets3t.

+1 on the same issue, segment upload is fine but historical cannot load from S3

I was able to resolve the issue, please see if this helps https://groups.google.com/forum/#!topic/druid-user/i3qK0u5BDGM

V4 auth problems:

The underlying problems are https://issues.apache.org/jira/browse/HADOOP-9248 and https://issues.apache.org/jira/browse/HADOOP-13325

The confirmed workaround is:

  1. Clone Druid master, add case "s3a": at line 404 of JobHelper.java, change aws-java-sdk to version 1.7.4 in pom.xml, rebuild

  2. In common.runtime.properties, configure S3 deep storage as normal.

  3. Save a file in conf/druid/_common/jets3t.properties with the contents:

s3service.s3-endpoint = s3.ap-northeast-2.amazonaws.com

storage-service.request-signature-version=AWS4-HMAC-SHA256

  1. Run: java -cp “dist/druid/lib/*” -Ddruid.extensions.directory=“dist/druid/extensions” -Ddruid.extensions.hadoopDependenciesDir=“dist/druid/hadoop-dependencies” io.druid.cli.Main tools pull-deps --no-default-hadoop -h “org.apache.hadoop:hadoop-client:2.7.2” -h “org.apache.hadoop:hadoop-aws:2.7.2”

  2. In druid.indexer.runner.javaOpts on middleManager, add -Dcom.amazonaws.services.s3.enableV4

  3. In job json, “hadoopDependencyCoordinates” : [“org.apache.hadoop:hadoop-client:2.7.2”, “org.apache.hadoop:hadoop-aws:2.7.2”]

  4. In job json, “jobProperties” : {

“fs.s3.impl” : “org.apache.hadoop.fs.s3a.S3AFileSystem”,

“fs.s3n.impl” : “org.apache.hadoop.fs.s3a.S3AFileSystem”,

“fs.s3a.endpoint” : “s3.ap-northeast-2.amazonaws.com”,

“fs.s3a.access.key” : “XXX”,

“fs.s3a.secret.key” : “YYY”

}

FYI to anyone reading, that list of stuff mixes together things needed to make Hadoop indexing work and things needed to make historical segment downloading work.

Any current information regarding the fix for this?

Druid 0.10.1 includes support for s3a so I would suggest giving that a shot.