[Workaround] Druid - Hadoop Indexer - Jackson Conflict

An annoying exception you may run across is

Error: class com.fasterxml.jackson.datatype.guava.deser.HostAndPortDeserializer overrides final method deserialize
...

Issue 2788 and some other issues report this too.

It is actually a big trouble as you need hadoop indexer to do offline ingestion, typically to fill up missing values. After many searches and blind trials, I finally decide to resolve by recompilation.

Root Cause
Jaskson library 1.x -> 2.x incompatibility. Hadoop depends on 1.x ones, and as a runtime library you can not exclude it.

Solution
Degrade Jackson versions will introduce new problem, 0.9.x will fail to read json config files if you swith to 2.3.5.
Shade is the straight forward way to solve it, you can take actions as below.

  • Add extensions you need to services/pom.xml like
<dependency>
     <groupId>io.druid.extensions</groupId>
     <artifactId>druid-avro-extensions</artifactId>
     <version>${project.parent.version}</version>
 </dependency>

 <dependency>
     <groupId>io.druid.extensions.contrib</groupId>
     <artifactId>druid-parquet-extensions</artifactId>
     <version>${project.parent.version}</version>
 </dependency>

 <dependency>
     <groupId>io.druid.extensions</groupId>
     <artifactId>druid-hdfs-storage</artifactId>
     <version>${project.parent.version}</version>
 </dependency>

 <dependency>
     <groupId>io.druid.extensions</groupId>
     <artifactId>mysql-metadata-storage</artifactId>
     <version>${project.parent.version}</version>
 </dependency>
  • shade jackson packages and assemble to a big jar.
<plugin>
     <groupId>org.apache.maven.plugins</groupId>
     <artifactId>maven-shade-plugin</artifactId>
     <executions>
         <execution>
             <phase>package</phase>
             <goals>
                 <goal>shade</goal>
             </goals>
             <configuration>
                 <outputFile>
                     ${project.build.directory}/${project.artifactId}-${project.version}-selfcontained.jar
                 </outputFile>
                 <relocations>
                     <relocation>
                         <pattern>com.fasterxml.jackson</pattern>
                         <shadedPattern>shade.com.fasterxml.jackson</shadedPattern>
                     </relocation>
                 </relocations>
                 <artifactSet>
                     <includes>
                         <include>*:*</include>
                     </includes>
                 </artifactSet>
                 <filters>
                     <filter>
                         <artifact>*:*</artifact>
                         <excludes>
                             <exclude>META-INF/*.SF</exclude>
                             <exclude>META-INF/*.DSA</exclude>
                             <exclude>META-INF/*.RSA</exclude>
                         </excludes>
                     </filter>
                 </filters>
                 <transformers>
                     <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                 </transformers>
             </configuration>
         </execution>
     </executions>
 </plugin>

copy services/target/xxxxx-selfcontained.jar out after mvn install for further useage.

  • Use a minimal pom.xml to run mvn dependency:copy-dependency and replace the hadoop client libraries provided in druid distribution.
  • And finally, run hadoop indexer as below, lib is not needed anymore. You don’t need to restart your running serivces:
java -Xmx32m \
  -Dfile.encoding=UTF-8 -Duser.timezone=UTC \
  -classpath config/hadoop:config/overlord:config/_common:$SELF_CONTAINED_JAR:$HADOOP_DISTRIBUTION/etc/hadoop \
  -Djava.security.krb5.conf=$KRB5 \
  io.druid.cli.Main index hadoop \
  $config_path

Hi,
Thanks for the neat write up.

It would be great if you could also submit a PR to add it to druid docs.

It’s my pleasure.

在 2016年4月11日星期一 UTC+8下午3:57:08,Nishant Bangarwa写道:

PR #2817

在 2016年4月11日星期一 UTC+8下午3:57:08,Nishant Bangarwa写道:

Hi,

I had the fat jar created as services/target/druid-0.9.0-selfcontained.jar but it doesn’t contain everything from druid library. So I’m wondering what does "Use a minimal pom.xml to run mvn dependency:copy-dependency mean? Thank you so much!

Sorry, I forgot to rely. A minimal pom.xml contains only hadoop client of your distribution, such as

<project xmlns=“http://maven.apache.org/POM/4.0.0” xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance

xsi:schemaLocation=“http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd”>

4.0.0

xx

xx

0.0.1-SNAPSHOT

org.apache.hadoop

hadoop-client

YOUR VERSION

And then, run mvn dependency:copy-dependency will generate your own hadoop dependency libraries in folder target/dependency. And then, delete all files of __lib/*.jar, __ put self-contained jar in it. And truncate hadoop-dependencies/hadoop-client/2.3.0 in downloaded druid distribution, and then place the jars of target/dependency in it. Thats all. Sorry again for late relay.

Hi

Failing in the below step. Unable to find or load main class io.druid.cli.Main

java -Xmx32m \
  -Dfile.encoding=UTF-8 -Duser.timezone=UTC \
  -classpath config/hadoop:config/overlord:config/_common:$SELF_CONTAINED_JAR:$HADOOP_DISTRIBUTION/etc/hadoop \
  -Djava.security.krb5.conf=$KRB5 \
  io.druid.cli.Main index hadoop \
  $config_path

Hi,

I was able to create the fat jar and start all the nodes including indexer. But while running my index_task.json, it’s failing with following error. could you please give me some input .
f45c899591af:druid-0.9.0 anindit$ curl -X ‘POST’ -H ‘Content-Type:application/json’ -Dhadoop.user.name=hadoop -Dhadoop.hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.SocksSocketFactory -Dhadoop.hadoop.socks.server=localhost:8157 -d @quickstart/ad3_index_task.json localhost:8090/druid/indexer/v1/task

Error 500

HTTP ERROR: 500

Problem accessing /druid/indexer/v1/task. Reason:

    java.lang.NoSuchMethodError: javax.servlet.http.HttpServletResponse.getHeader(Ljava/lang/String;)Ljava/lang/String;

Powered by Jetty://

I just use it for offline hadoop indexing by command line. Indexing through a POST is not tested.

在 2016年4月26日星期二 UTC+8上午10:19:05,anindita dey写道:

Sorry if i am hijacking this thread, but after countless attempts to solve this issue, I’ve tried setting the

“mapreduce.job.user.classpath.first”: “true”
in my spec file, but i get this error when running the job:

Diagnostics: Exception from container-launch.
Container id: container_e02_1461544451524_0047_05_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:371)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Shell output: main : command provided 1
main : run as user is nobody
main : requested yarn user is druid

I have no leads to move on to, any help is appreciated

Thanks for the reply. Could you please give me an example how you are running your jobs. Also, do you think adding Servlet and Http jars into the fat jar will allow me to perform the indexing through a POST?

Hi,

It will be reply helpful, if you could post the full dependency dump.

thanks a lot.

I have set up a druid cluster running normally before I realize problems with hadoop indexing. And I just want to set up another way to do ingestion standalone. My way of fat jar is meant to do hadoop indexing only.

The output is rather helpless, more related logs should be look into before locating the key problems.

在 2016年4月26日星期二 UTC+8下午11:58:16,Stelios Savva写道:

  1. your hadoop-client full dependency, this can be pulled with a pom only contains your distribution of hadoop-client. It is all upon to your distribution, so be careful to choose the one match your hadoop cluster.

  2. All except the first have posted above, I don’t know where you get stuck.

You can refer to this page for more infomation.

在 2016年4月27日星期三 UTC+8上午4:10:27,anindita dey写道:

Thanks Ninglin!

I saw your write-up. It was very helpful, really thank you for that.

I saw one line, mentioned below throwing page not found error.
"(3) Cd to ‘druid_build’ and create the build.sbt file with the content here. "

But, other than that I was able to get the fat jar running for an offline hadoop.

HI Stelios Savva
were you able to solve this issue ?