Hadoop Ingestion in Druid 0.15

Hi,
Iam ingesting data from hdfs (tutorial sample provided with Druid 0.15.

Iam trying it on HDP 3.1 (HDFS, YARN & MR).

Attached ingest spec - wikipedia-index-hadoop.json.

After submitting task, iam able to see a job getting submitted in YARN. Attached logs (druid_yarn_2.8.3.txt, druid_task_log_2.8.3.txt). Iam getting error

org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.NumberFormatException: For input string: “30s”

I couldnt find any answers to the issues, so i thought since iam using HDP 3.1 (Hadoop 3.1), may be i need to change the hadoop client.

So i copied hadoop client 3.1.1 in “hadoop-dependencies” folder and also modified input spec to use 3.1.1. Now iam getting different error (druid_task_log_3.1.1.txt).

The task is not even getting submitted to YARN.

Can somebody please help.

Thanks in advance.

wikipedia-index-hadoop.json (2.37 KB)

druid_task_log_2.8.3.txt (270 KB)

druid_task_log_3.1.1.txt (223 KB)

druid_yarn_2.8.3.txt (4.46 KB)

Did you check the following?

[2019-07-12 14:07:58.112]Container exited with a non-zero exit code 1. Error file: prelaunch.err.

Last 4096 bytes of prelaunch.err :

Last 4096 bytes of stderr :

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/133/log4j-slf4j-impl-2.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/24/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.

Eric Graham

Solutions Engineer -** **Imply

**cell: **303-589-4581

email: eric.graham@imply.io

www.imply.io

Hi Lakshmi:

Can you also try adding

“mapreduce.job.user.classpath.first”: “true”,

“mapreduce.job.classloader.system.classes”: “-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.”

to your tunning config -> job properties ?

Hi Ming,

Thanks for your response.

The config entry “mapreduce.job.user.classpath.first” was already present. I added the second one you suggested. But still no luck.

Attaching the ingest spec ( wikipedia-index-hadoop.json) and the yarn log (wikipedia-index-hadoop-yarn.log), in case if that gives any clue for the issue.

Can you please help. Thanks in advance.

Regards, Chari.

wikipedia-index-hadoop.json (2.5 KB)

wikipedia-index-hadoop-yarn.log (207 KB)

Hi Lakshimi:

Thanks for trying. Can you give us the full ingestion task log to review? There is something submitted from Druid side to YARN and YARN does not like it.

Hi Ming,

Attached payload and log file. Hope this is what you are asking. Otherwise please let me know how/from where can i extract the information you are asking.

Thank you.

Regards, Chari.

payload.txt (4.43 KB)

task-log-index_hadoop_wikipedia_hdfs_2019-07-16T11_23_04.028Z.json (314 KB)

Hello Chari,

Thanks for providing the requested information by Ming. While he looks at those logs, can you give a try one more time by adding the following properties to jobProperties section please?

“mapreduce.job.user.classpath.first”: “true”,

“fs.s3a.readahead.range”: “65536”,

“fs.s3a.multipart.size”: “104857600”,

“fs.s3a.block.size”: “33554432”

Thanks,
Mohan Vedicherla
Customer Success

Hi Mohan,

I tried the configurations you mentioned. Iam getting same error. Attached YARN job log.

Also attaching ingestion json spec.

Can you please suggest next steps.

Regards, Chari.

druid_yarn_log_23Jul2019.txt (4.53 KB)

wikipedia-index-hadoop-23Jul2019.json (2.65 KB)

Hi,

Could you check the data you are ingesting ?

Is it possible your “delta” field could be expressed as “30s” for 30 seconds ? If its the case, it cannot be parsed as long (as you expect in your spec)

Le mar. 23 juil. 2019 à 12:54, Lakshminarayana Chari lnc.adoni@gmail.com a écrit :

Hi Guillaume,

The ingest iam trying is wiki example which comes with Druid tutorial. So i dont know if there is any data issues.

To rule out data scenario, i created a simple csv data (just 10rows). Using native ingestion spec, i could ingest data successfully.

So i used the same data, with Hadoop ingestion spec. But iam getting same “30s” error.

So i dont think it is anything related to data issue.

Regards, Chari.

I have the same issue as chari mentioned and I’ve tried a lot of things like I change my hadoop-client-2.8.3 to 3.2.0, added job property i.e “mapreduce.job.classloader.system.classes” and change java-version but could not get any little bit solution. here are my some files related with my hadoop-ingestion task.

hadoop-indexing-logs (276 KB)

Schema (3.45 KB)

Hi Here, I was able to successfully ingest the wikiticker json file. Here is my ingestion spec if any useful:

{

“type” : “index_hadoop”,

“spec” : {

"dataSchema" : {

  "dataSource" : "wikipedia_hdfs",

  "parser" : {

    "type" : "hadoopyString",

    "parseSpec" : {

      "format" : "json",

      "dimensionsSpec" : {

        "dimensions" : [

          "channel",

          "cityName",

          "comment",

          "countryIsoCode",

          "countryName",

          "isAnonymous",

          "isMinor",

          "isNew",

          "isRobot",

          "isUnpatrolled",

          "metroCode",

          "namespace",

          "page",

          "regionIsoCode",

          "regionName",

          "user",

          { "name": "added", "type": "long" },

          { "name": "deleted", "type": "long" },

          { "name": "delta", "type": "long" }

        ]

      },

      "timestampSpec" : {

        "format" : "auto",

        "column" : "time"

      }

    }

  },

  "metricsSpec" : [],

  "granularitySpec" : {

    "type" : "uniform",

    "segmentGranularity" : "day",

    "queryGranularity" : "none",

    "intervals" : ["2015-09-12/2015-09-13"],

    "rollup" : false

  }

},

"ioConfig" : {

  "type" : "hadoop",

  "inputSpec" : {

    "type" : "static",

"paths" : "/user/root/wikiticker-2015-09-12-sampled.json.gz"

  }

},

"tuningConfig" : {

  "type" : "hadoop",

  "partitionsSpec" : {

    "type" : "hashed",

    "targetPartitionSize" : 5000000

  },

  "forceExtendableShardSpecs" : true,

  "jobProperties" : {

    "mapreduce.job.classloader": "true",

    "[io.compression.codecs](http://io.compression.codecs)": "[org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec](http://org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec)",

    "mapreduce.map.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8",

    "mapreduce.job.user.classpath.first" : "true",

    "mapreduce.reduce.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8",

    "hdp.version": "3.1.0.0-78",

    "mapreduce.job.classloader.system.classes": "-javax.validation.,java.,javax.,[org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop](http://org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop)."

  }

}

},

“hadoopDependencyCoordinates”: [“org.apache.hadoop:hadoop-client:2.8.3”]

}

My Druid version was “0.15.0-incubating-iap2”, an Imply release, My Hadoop cluster was Hortonworks Hadoop 3.1.1.3.1.0.0-78 . All I did was copying the HDP’s XML files, core-site, hdfs-site, yarn-site etc to Druid’s “_common” directory. That was it.

Hi Ming,

Thanks for the ingestion Spec. It worked for me.

Now the MR job is getting executed, though the job status is “finished” in Yarn, the job status is shown as failed in Druid console.

After analyzing the logs, i see permissions errors on HDFS. I will fix those and re-run the ingestion spec. I will update you.

I think the only difference in the spec is “hdp.version”: “3.1.0.0-78”. I will remove and run again and see if this is the culprit.

Regards,Chari.

Thanks Ming I tried but again facing an error this is my payload and logs.

task-log-index_hadoop_flight_jan_2019-07-31T09_34_17.432Z.json (15.4 KB)

task-payload-index_hadoop_flight_jan_2019-07-31T09_34_17.432Z.json (6.39 KB)

The task log only shows a mapred Running job: job_1564564846091_0002 had been launched, without any further details. Was there more logs after that?

I added the property “hdp.version”: “3.1.0.0-78” to overcome this hortonworks bug, and not sure if it’s related to the original problem. :slight_smile: https://community.hortonworks.com/questions/74286/hdp-24-mapreduce-error-exception-in-thread-main-ja.html

No Ming

Probably check what the YARN application log says why the jobs hung?

I think map reduce job has been launching but not proceed any further even I tried on hive but it give the same logs and stuck at job launch. I tuned my mapred and yarn in *site.xml but again facing disappointment