[druid-user] hadoop aws embedded librairies not working as expected for ingestion

Hi,

I’m currently working on upgrading our clusters from 0.15.1-incubating to 0.22.1.
After reading all changelogs to identify what issues I could encounter and conclude that there was none, I tried to update a test cluster.
Rolling update was done following recommandations :http://druid.io/docs/0.9.0-rc1/operations/rolling-updates.html

And everything went well. I can still query my data, all nodes are up (after fixing some stuffs, but just minor fixes).

Honnestly, that was very nice.

The last part I had to test is my ingestion tasks.

All my ingestions tasks are based on the same pattern :

  • hadoop ingestion, source are parquet files stored in a aws s3 bucket.

Previously (in 0.15.1), to make that ingestion work, I had to manually add aws hadoop dependencies in my middleManager nodes so that it could work.
After reading changelogs, I noticed that point in 0.18 release : https://github.com/apache/druid/releases#upgrade-hadoop-aws
Nice, I can get rid of adding the hadoop dependencies and use the out-of-the-box ones.

However, this is not working as intended.
With exact same ingestion task, I ran into this error :

2022-01-18T18:05:03,859 INFO [task-runner-0-priority-0] org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at emr-cluster/172.30.xx.xx:8032
2022-01-18T18:05:04,364 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2022-01-18T18:05:04,374 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-01-18T18:05:04,759 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - Cleaning up the staging area /tmp/hadoop-yarn/staging/druid/.staging/job_1640710029235_6755
2022-01-18T18:05:04,763 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.HadoopIndexTask - Encountered exception in HadoopIndexGeneratorInnerProcessing.
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
	at org.apache.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:242) ~[druid-indexing-hadoop-0.22.1.jar:0.22.1]
	at org.apache.druid.indexer.JobHelper.runJobs(JobHelper.java:399) ~[druid-indexing-hadoop-0.22.1.jar:0.22.1]
	at org.apache.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:100) ~[druid-indexing-hadoop-0.22.1.jar:0.22.1]
	at org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessingRunner.runTask(HadoopIndexTask.java:834) [druid-indexing-service-0.22.1.jar:0.22.1]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_312]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_312]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_312]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_312]
	at org.apache.druid.indexing.common.task.HadoopIndexTask.runInternal(HadoopIndexTask.java:442) [druid-indexing-service-0.22.1.jar:0.22.1]
	at org.apache.druid.indexing.common.task.HadoopIndexTask.runTask(HadoopIndexTask.java:284) [druid-indexing-service-0.22.1.jar:0.22.1]
	at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:159) [druid-indexing-service-0.22.1.jar:0.22.1]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:471) [druid-indexing-service-0.22.1.jar:0.22.1]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:443) [druid-indexing-service-0.22.1.jar:0.22.1]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_312]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]

oh. Not nice at all.
This S3AFileSystem class should be located in a hadoop-aws-.jar, according to https://hadoop.apache.org/docs/r3.1.2/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#ClassNotFoundException:_org.apache.hadoop.fs.s3a.S3AFileSystem.

When lookin into my Druid node, I can find this class only here

sh-4.2$ find /opt/druid-0.22.1 -name ‘*.jar’ -print | while read i; do unzip -l “$i” | grep -Hsi S3AFileSystem && echo “$i”; done
(standard input): 716 09-10-2018 11:56 org/apache/hadoop/fs/s3a/S3AFileSystem$3.class
(standard input): 737 09-10-2018 11:56 org/apache/hadoop/fs/s3a/S3AFileSystem$4.class
(standard input): 8140 09-10-2018 11:56 org/apache/hadoop/fs/s3a/S3AFileSystem$WriteOperationHelper.class
(standard input): 960 09-10-2018 11:56 org/apache/hadoop/fs/s3a/S3AFileSystem$1.class
(standard input): 54631 09-10-2018 11:56 org/apache/hadoop/fs/s3a/S3AFileSystem.class
(standard input): 1238 09-10-2018 11:56 org/apache/hadoop/fs/s3a/S3AFileSystem$2.class
/opt/druid-0.22.1/extensions/druid-hdfs-storage/hadoop-aws-2.8.5.jar

The only hadoop-aws jar is located under druid-hdfs-storage extension folder and I am not using this extension.
Nothing under the hadoop-dependencies folder.

Did I misunderstood the changelog message ? Is it applicable to only hdfs-storage extension ?
I thought I could also use the aws hadoop libs for my ingestion tasks.

NB : I know I could know use the native ingestion that is now capable of reading parquet files (this was not the case back in 0.15.1).
But we can’t work on this for the moment.

Any hint will be appreciated :slight_smile:

So I don’t profess to know the answer here I’m afraid (!) but I was aware that this doc exists – hidden in the docs – that I wonder might help a little???!

Oh, I remember this one :slight_smile:
About those tips :

#1 : Place Hadoop XMLs on Druid classpath
Already done :slight_smile:

#2 : Classloader modification on Hadoop (Map/Reduce jobs only)
I’ll give a try to this one, but I expected not having to change any ingestion spec and the upgrade being transparent for our customers.
I’m still wondering why the hadoop-aws jar is not present in hadoop dependencies despite the changelog.

Am I misreading it ?

If you can get any informations about that point, it would be really nice :slight_smile:

#3 : Use specific versions of Hadoop libraries

That’s exactly the point I’m trying to avoid :slight_smile:

I’ll let you know if the tip #2 fixes it.

Well, no changes :frowning:

After reflexion, it seems logical.
I think the error is happening before even sending anything to my EMR cluster (I don’t have any trace of a job on it).

All the logs happen on the middle manager (task runner logs) so It can be something that is messing up with remote classpath.

I’ll try to manually add the missing jar, to check if it would work.

However, starting from here, I’m not sure any native ingestion task would work with OOTB hadoop libs.

I’m really feeling like I’m missing something but can’t figure out what.

argh,

I tried to readd my EMR jars as I did for previous versions, but now I’m hitting errors like this

java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.Job.getArchiveSharedCacheUploadPolicies(Lorg/apache/hadoop/conf/Configuration;)Ljava/util/Map;
	at org.apache.hadoop.mapreduce.v2.util.MRApps.setupDistributedCache(MRApps.java:520) ~[?:?]
	at org.apache.hadoop.mapred.YARNRunner.setupContainerLaunchContextForAM(YARNRunner.java:544) ~[?:?]
	at org.apache.hadoop.mapred.YARNRunner.createApplicationSubmissionContext(YARNRunner.java:582) ~[?:?]
	at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:325) ~[?:?]
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:242) ~[?:?]
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341) ~[?:?]
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_312]
	at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_312]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) ~[?:?]
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338) ~[?:?]
	at org.apache.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:207) ~[druid-indexing-hadoop-0.22.1.jar:0.22.1]
	at org.apache.druid.indexer.JobHelper.runJobs(JobHelper.java:399) ~[druid-indexing-hadoop-0.22.1.jar:0.22.1]
	at org.apache.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:100) ~[druid-indexing-hadoop-0.22.1.jar:0.22.1]
	at org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessingRunner.runTask(HadoopIndexTask.java:834) ~[druid-indexing-service-0.22.1.jar:0.22.1]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_312]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_312]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_312]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_312]
	at org.apache.druid.indexing.common.task.HadoopIndexTask.runInternal(HadoopIndexTask.java:442) ~[druid-indexing-service-0.22.1.jar:0.22.1]
	at org.apache.druid.indexing.common.task.HadoopIndexTask.runTask(HadoopIndexTask.java:284) [druid-indexing-service-0.22.1.jar:0.22.1]
	at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:159) [druid-indexing-service-0.22.1.jar:0.22.1]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:471) [druid-indexing-service-0.22.1.jar:0.22.1]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:443) [druid-indexing-service-0.22.1.jar:0.22.1]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_312]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]

which seems related to some conflicts.
For this specific case, I spotted parquet extension having the jar hadoop-mapreduce-client-core that conflicts with the one I provide.
I tried to remove the extension jar, and I just hit the same error on another class.

Has anyone run hadoop ingestion tasks of parquet files from S3 on latest Druid version ? 
Is everything ok ? I'm really interested in some feedbacks


Has anyone