Is there any possibility to work with S3 via plugin via "assume Role" and not with credentials

Hi everyone!

As we are using EC2 instanes for druid-cluster, it is not secure to use “accessKey” and “secretKey” for access to s3 bucket.
We have generated Policy and Role which allow to our instances work with S3 buckets.
So is it possible to work with S3 buckets only with providing “druid.storage.bucket” parameter, probably Rolename if it is required?

Also it would be cool if someone can share “init.d scripts” for druid-cluster (centos).

Best regards,
Oleksandr.

Hi Oleksandr,

We ran into the same issue recently when setting it up as well. Druid does support this, but it is currently not documented.

In your _common/common.runtime.properties file, instead of using druid.s3.accessKey and druid.s3.secretKey, use the following:

druid.s3.fileSessionCredentials=<IAM_ROLE_NAME>

where IAM_ROLE_NAME is the name of the role it will extract from the instance metadata’s map

It should also be noted for your question that we do use druid.storage.baseKey for our base filepath in the specified bucket.

Thank you, for your reply!
Let me specify a bit more.
So i should just specify:
druid.s3.fileSessionCredentials=<IAM_ROLE_NAME>

Is “<IAM_ROLE_NAME>” an arn like “arn:aws:iam::account_id:role/role_name” ?

The <IAM_ROLE_NAME> is the name without the arn, so in your example it would simply be role_name :slight_smile:

Many thanks Robert!

It works for me!

Hi Robert!

I am using druid.s3.fileSessionCredentials=<IAM_ROLE_NAME> as you mentioned before.
Everything fine for logs. I see them in S3 bucket.
But now i am trying to load test data, and i am getting an error:
“AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively)”

Seems that hadoop should be configured additionally. But is it possible to use Role for that too?
Do you have any experience with that?

{
“type” : “index_hadoop”,
“spec” : {
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “static”,
“paths” : “quickstart/wikiticker-2015-09-12-sampled.json.gz”
}
},
“dataSchema” : {
“dataSource” : “wikiticker”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “day”,
“queryGranularity” : “none”,
“intervals” : [“2015-09-12/2015-09-13”]
},
“parser” : {
“type” : “hadoopyString”,
“parseSpec” : {
“format” : “json”,
“dimensionsSpec” : {
“dimensions” : [
“channel”,
“cityName”,
“comment”,
“countryIsoCode”,
“countryName”,
“isAnonymous”,
“isMinor”,
“isNew”,
“isRobot”,
“isUnpatrolled”,
“metroCode”,
“namespace”,
“page”,
“regionIsoCode”,
“regionName”,
“user”
]
},
“timestampSpec” : {
“format” : “auto”,
“column” : “time”
}
}
},
“metricsSpec” : [
{
“name” : “count”,
“type” : “count”
},
{
“name” : “added”,
“type” : “longSum”,
“fieldName” : “added”
},
{
“name” : “deleted”,
“type” : “longSum”,
“fieldName” : “deleted”
},
{
“name” : “delta”,
“type” : “longSum”,
“fieldName” : “delta”
},
{
“name” : “user_unique”,
“type” : “hyperUnique”,
“fieldName” : “user”
}
]
},
“tuningConfig” : {
“type” : “hadoop”,
“partitionsSpec” : {
“type” : “hashed”,
“targetPartitionSize” : 5000000
},
“jobProperties” : {}
}
}
}

Hi Oleksandr,

Great question! This is something we ran into as well. The root cause of this is that Druid does not support the s3a protocol, which allows for hadoop authentication via the ec2 instance’s metadata.

For now, we are sending the access key and secret key in the index request

E.g.

“tuningConfig” : {
“type” : “hadoop”,

  "jobProperties" : {
    "fs.s3.awsAccessKeyId" : "ACCESS_KEY_ID",
    "fs.s3n.awsAccessKeyId" : "ACCESS_KEY_ID",
    "fs.s3.awsSecretAccessKey" : "SECRET_ACCESS_KEY",
    "fs.s3n.awsSecretAccessKey" : "SECRET_ACCESS_KEY",
    "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
    "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem"
  }
}

``

A solution to this (s3a support) is in the works at the moment, and is quite active.

We are following

https://github.com/druid-io/druid/pull/4116

I hope this will be in 10.1, but we shall see.

Also, for security I recommend you lock down access to port 8090 since your access/secret keys will be sitting out in the open. You should also create an IAM user who can only access the IP of the box you're running on so that if someone does steal it, they only have HTTP access to that box, and nothing else in your cloud infrastructure.

Hi Robert!
Thank you for your reply!
It works for me, but now i am getting strange error:
2017-04-18T14:25:33,246 DEBUG [pool-23-thread-1] org.apache.hadoop.fs.s3native.NativeS3FileSystem - getFileStatus could not find key ‘druid/segments/wikiticker-s3-new/2015-09-12T00:00:00.000Z_2015-09-13T00:00:00.000Z/2017-04-18T14:25:09.949Z/0/index.zip.0’ 2017-04-18T14:25:33,246 DEBUG [pool-23-thread-1] org.apache.hadoop.fs.s3native.NativeS3FileSystem - Renaming ‘s3n://BUCKET_NAME/druid/segments/wikiticker-s3-new/2015-09-12T00:00:00.000Z_2015-09-13T00:00:00.000Z/2017-04-18T14:25:09.949Z/0/index.zip.0’ to ‘s3n://BUCKET_NAME/druid/segments/wikiticker-s3-new/2015-09-12T00:00:00.000Z_2015-09-13T00:00:00.000Z/2017-04-18T14:25:09.949Z/0/index.zip’ - returning false as src does not exist 2017-04-18T14:25:33,250 INFO [Thread-61] org.apache.hadoop.mapred.LocalJobRunner - reduce task executor complete. 2017-04-18T14:25:33,254 WARN [Thread-61] org.apache.hadoop.mapred.LocalJobRunner - job_local1076469020_0002 java.lang.Exception: java.io.IOException: Unable to rename [s3n://BUCKET_NAME/druid/segments/wikiticker-s3-new/2015-09-12T00:00:00.000Z_2015-09-13T00:00:00.000Z/2017-04-18T14:25:09.949Z/0/index.zip.0] to [s3n://BUCKET_NAME/druid/segments/wikiticker-s3-new/2015-09-12T00:00:00.000Z_2015-09-13T00:00:00.000Z/2017-04-18T14:25:09.949Z/0/index.zip] at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) ~[hadoop-mapreduce-client-common-2.3.0.jar:?] at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) [hadoop-mapreduce-client-common-2.3.0.jar:?] Caused by: java.io.IOException: Unable to rename [s3n://BUCKET_NAME/druid/segments/wikiticker-s3-new/2015-09-12T00:00:00.000Z_2015-09-13T00:00:00.000Z/2017-04-18T14:25:09.949Z/0/index.zip.0] to [s3n://BUCKET_NAME/druid/segments/wikiticker-s3-new/2015-09-12T00:00:00.000Z_2015-09-13T00:00:00.000Z/2017-04-18T14:25:09.949Z/0/index.zip] at io.druid.indexer.JobHelper.serializeOutIndex(JobHelper.java:452) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2] at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:727) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2] at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:478) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2] at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) ~[hadoop-mapreduce-client-core-2.3.0.jar:?] at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) ~[hadoop-mapreduce-client-core-2.3.0.jar:?] at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) ~[hadoop-mapreduce-client-core-2.3.0.jar:?] at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) ~[hadoop-mapreduce-client-common-2.3.0.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:473) ~[?:1.7.0_131] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_131] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[?:1.7.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[?:1.7.0_131] at java.lang.Thread.run(Thread.java:745) ~[?:1.7.0_131]

``

This file is really there.
Have you ever seen something like that?

{
“type” : “index_hadoop”,
“spec” : {
“ioConfig” : {
“type” : “hadoop”,
“inputSpec” : {
“type” : “static”,
“paths” : “quickstart/wikiticker-2015-09-12-sampled.json.gz”
}
},
“dataSchema” : {
“dataSource” : “wikiticker-s3-new”,
“granularitySpec” : {
“type” : “uniform”,
“segmentGranularity” : “day”,
“queryGranularity” : “none”,
“intervals” : [“2015-09-12/2015-09-13”]
},
“parser” : {
“type” : “hadoopyString”,
“parseSpec” : {
“format” : “json”,
“dimensionsSpec” : {
“dimensions” : [
“channel”,
“cityName”,
“comment”,
“countryIsoCode”,
“countryName”,
“isAnonymous”,
“isMinor”,
“isNew”,
“isRobot”,
“isUnpatrolled”,
“metroCode”,
“namespace”,
“page”,
“regionIsoCode”,
“regionName”,
“user”
]
},
“timestampSpec” : {
“format” : “auto”,
“column” : “time”
}
}
},
“metricsSpec” : [
{
“name” : “count”,
“type” : “count”
},
{
“name” : “added”,
“type” : “longSum”,
“fieldName” : “added”
},
{
“name” : “deleted”,
“type” : “longSum”,
“fieldName” : “deleted”
},
{
“name” : “delta”,
“type” : “longSum”,
“fieldName” : “delta”
},
{
“name” : “user_unique”,
“type” : “hyperUnique”,
“fieldName” : “user”
}
]
},
“tuningConfig” : {
“type” : “hadoop”,
“partitionsSpec” : {
“type” : “hashed”,
“targetPartitionSize” : 5000000
},
“jobProperties” : {
“fs.s3.awsAccessKeyId” : “",
“fs.s3n.awsAccessKeyId” : "
”,
“fs.s3.awsSecretAccessKey” : “",
“fs.s3n.awsSecretAccessKey” : "
”,
“fs.s3.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”,
“fs.s3n.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”
}
}
}
}

``

Hey Oleksandr,

I haven’t seen anything like that before, but it looks like it’s renaming a file in s3, then attempting to access via the pre-renaming file name.

In our system we use spark to upload the data file to s3 before we send the request to index the data, which then pulls down the data file from s3 into hadoop for indexing. Perhaps it has to do with the fact that it’s a zip? Either way, it would take some playing around with to get right. Try making another question for that since it appears your initial problem was solved :slight_smile:

Thank you, for your reply!
I resolved this problem with updating to druid-0.10.0.

Best regards,
Oleksandr.

Nope, problem was not resolved yet :frowning:

Now, problem is resolved, was set incorrect permissions for user
keys.