Use AWS EC2 role for access to S3 on Hadoop Ingestion Task

Hi,

The druid cluster we have set up ingests files from S3. The EC2 instances running middle manager have been assigned AWS roles which provide access to S3. We do not supply the AWS credentials in the common.runtime.properties file of Druid.

This works fine for native ingestion tasks. The middle manager is able to pull file from S3 and ingest it.

But when it comes to hadoop ingestion tasks, Druid mandates us provide the AWS credentials in the ingestion template (**jobProperties **(tuningConfig) section of template). Tried skipping providing the AWS credentials and tried providing incorrect values, but it doesn’t work.

Note: We are not running the hadoop ingestion tasks against a remote hadoop cluster.

“fs.s3n.awsAccessKeyId”: “AWS_ACCESS_KEY_ID”
“fs.s3n.awsSecretAccessKey”: “AWS_SECRET_ACCESS_KEY”

``

Is there a way to skip providing the AWS secret/key properties in ingestion template, and make the hadoop ingestion task use the instance AWS role for accessing the S3 files?

Regards,

Vinay

Hi Vinay,
Is it working if you AWS credentials in common.runtime.properties instead of ingestion spec?

Thanks,

–siva

Hi Siva,

Hadoop ingestion task fails it is submitted without AWS cresentials.

Regards,
Vinay

Hi Vinay,
Went through few docs related to index_hadoop plus S3. Pretty much every doc suggests to give AWS credentials in ingestion spec itself.

To me, it looks like feature request might be needed if you don’t want to give AWS credentials in hadoop ingestion spec of druid.

I will keep looking for any alternative but whatever docs I looked at are suggesting to give AWS crendentials in ingestion spec itself.

Is it because of some security reasons you don’t want to give AWS credentials in ingestion spec?

Thanks,

–siva

Hi Vinay,
I also want to confirm that you gave AWS credentials in common.runtime.properties file on all the nodes of your cluster.

Then you restarted your druid cluster and retried the hadoop ingestion with S3 and it is failing stating something like invalid credentials

Am I right?

Thanks,

–siva

Hi Vinay,

The answer depends on where is located your Hadoop cluster.

If it is outside aws, then no, you can’t avoid to provide credentials in ingestion spec.

If it is in aws, then yes, there is another way (I went through this as I didn’t wanted to pass my credentials clearly either)

Instead, you can attach a role to your Hadoop instances (emr, or EC2 instances) so that they will have access to both your source bucket and your deep storage.

Here is a topic where I explained how to do it : https://groups.google.com/forum/m/#!search/%5Bdruid-user%5D$20Druid$200.14.1$20-$20Map$2FReduce$20indexing$20task$20fail$20$20AWS$20Signature$20Version$204/druid-user/CifknAFoOGc

Guillaume

Not sure the link will work (on my phone, not so practical )
So you can also search for “Druid 0.14.1 - Map/Reduce indexing task fail AWS Signature Version 4” in the druid user group topics

If your whole ecosystem is on AWS, then the instances running your Hadoop (I assume at the bare bone level, it will use an EC2 instance) should have the IAM role with required perms attached to it.

If Hadoop is external, then it’s more of a Hadoop related question than druid.

Hi Siva,

Storing aws credentials in the common.runtime.prooerties file is something we don't want to do. So we have provided the role with necessary permissions to the EC2 role.

Regards,
Vinay

Hi Karthik,

Our druid cluster runs on AWS. We are not using an external had pop cluster to run hadoop based ingestion tasks. The EC2 instance which the middle managers run on has the role attached with necessary permissions to S3 buckets. But Druid does not allow us to submit hadoop ingest ion tasks without AWS credentials on the template. If we skip sending the details or provide incorrect details for the credrntials, the task fails

Regards,
Vinay

Hi Karthik,

Can u attached the ingestion spec.

Is there any specific reason why you want to use hadoop ingestion task instead of native ingestion ?

Thanks

Parquet/Avro files ingestion require hadoop ingestion task. They cannot be ingested using native tasks.

Regards,

Vinay

Hi Vinay ,

It should allow . Would you mind attaching the ingestion task json .

Thanks

Hi Thomas,

There is thread where it is mentioned that parquet file ingestion needs Hadoop ingestion task - https://groups.google.com/forum/m/#!topic/druid-user/0ubPu-i7kmM

Also, documentation says the same - https://druid.apache.org/docs/latest/development/extensions-core/parquet.html

Here’s my hadoop ingestion task template. Works fine if AWS credentials are provided in job properties.

{
“type”: “index_hadoop”,
“spec”: {
“ioConfig”: {
“type”: “hadoop”,
“inputSpec”: {
“type”: “static”,
“inputFormat”: “org.apache.druid.data.input.parquet.DruidParquetInputFormat”,
“paths”: “FILE-LOCATION”
}
},
“dataSchema”: {
“dataSource”: “parquet_data_ingestion_test”,
“parser”: {
“type”: “parquet”,
“parseSpec”: {
“format”: “timeAndDims”,
“timestampSpec”: {
“column”: “date”,
“format”: “yyyy-MM-dd”
},
“dimensionsSpec”: {
“dimensions”: [
“platform”,
“manufacturer”,
“browser”
],
“dimensionExclusions”: ,
“spatialDimensions”:
}
}
},
“metricsSpec”: [
{
“type”: “longSum”,
“name”: “time_spent”,
“fieldName”: “time_spent”
}
],
“granularitySpec”: {
“type”: “uniform”,
“segmentGranularity”: “DAY”,
“queryGranularity”: “DAY”,
“intervals”: [
“INTERVAL”
]
}
},
“tuningConfig”: {
“type”: “hadoop”,
“partitionsSpec”: {
“targetPartitionSize”: 5000000
},
“jobProperties”: {
“mapreduce.map.java.opts”: “-server -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps”,
“mapreduce.reduce.java.opts”: “-server -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps”,
“mapred.child.java.opts”: “-server -XX:+PrintGCDetails -XX:+PrintGCTimeStamps”,
“io.compression.codecs” : “org.apache.hadoop.io.compress.GzipCodec”,
“fs.s3n.awsAccessKeyId” : “AWS_ACCESS_KEY_ID”,
“fs.s3n.awsSecretAccessKey” : “AWS_SECRET_ACCESS_KEY”,
“fs.s3n.impl” : “org.apache.hadoop.fs.s3native.NativeS3FileSystem”
},
“leaveIntermediate”: true
}
}
}

``

Hi,

I confirm that parquet files cannot be ingested with natives tasks from S3

Also, to get rid of credentials in your ingestion spec, you need to use s3a instead of s3n and do the following

Set this property in your Druid config

druid.storage.useS3aSchema=true

``

Then your spec can contain :

“fs.s3a.awsAccessKeyId”: “accesskey”,
“fs.s3a.awsSecretAccessKey”: “secretkey”,
“fs.s3a.impl”: “org.apache.hadoop.fs.s3a.S3AFileSystem”,
“fs.s3a.server-side-encryption-algorithm”: “AES256”,
“fs.s3.impl”: “org.apache.hadoop.fs.s3a.S3AFileSystem”,

``

fs.s3.impl is still needed to make it work.

Also, an even more secure way to pass spec without credentials in it :

“fs.s3a.impl”: “org.apache.hadoop.fs.s3a.S3AFileSystem”,
“fs.s3a.aws.credentials.provider”: “com.amazonaws.auth.InstanceProfileCredentialsProvider”,
“fs.s3a.server-side-encryption-algorithm”: “AES256”,
“fs.s3.impl”: “org.apache.hadoop.fs.s3a.S3AFileSystem”,

``

It requires you to fulfill your EMR instance role the policies to access your s3 buckets (source bucket AND druid deep storage if applicable) and kms:Decrypt to your kms key

If you use Instance Role Profile, you can also omit the credentials.provider property as long as you don’t provide any other credential properties (as InstanceProfileCredentialsProvider is the last checked authentication method, see https://hadoop.apache.org/docs/r2.8.3/hadoop-project-dist/hadoop-common/core-default.xml for more details)

This works perfectly

Guillaume

Hi Vinay,

I got ur point . You dont want to keep the S3 credential any where in the config files . hadoop ingestion task look for the schema (like hdfs:// , s3:// ) to decide how to communicate with the file system . So it looks for key details .

I feel the approach which Guillaume has mentioned in the previous mail is a good option. Please see if this works for you .

Let me also explore if any other options are possible to make use of EC2 role from hadoop ingestion task.

Thanks