Indexing task ingesting data from S3 failing

Hello,

I have the following situation… I have data on S3 in single directory containing many part files with one line per event (JSON). I want to run a batch indexing job to load the data into Druid. I’ve created an index_hadoop task. The task appears to successfully read all data from S3, process it, and then fails at the very end due to an exception. Any suggestions on how to resolve this?

Here is the exception from the indexing task:

2016-04-27T23:56:21,905 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_experiment-events_2016-04-27T23:53:09.664Z, type=index_hadoop, dataSource=experiment-events}]
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
        at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
        at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:160) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:208) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:338) [druid-indexing-service-0.9.0.jar:0.9.0]
        at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:318) [druid-indexing-service-0.9.0.jar:0.9.0]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_91]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_91]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_91]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_91]
        at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:157) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        ... 7 more
Caused by: com.metamx.common.ISE: Job[class io.druid.indexer.IndexGeneratorJob] failed!
        at io.druid.indexer.JobHelper.runJobs(JobHelper.java:343) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
        at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.0.jar:0.9.0]
        at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_91]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_91]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_91]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_91]
        at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:157) ~[druid-indexing-service-0.9.0.jar:0.9.0]
        ... 7 more
2016-04-27T23:56:21,916 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_hadoop_experiment-events_2016-04-27T23:53:09.664Z",
  "status" : "FAILED",
  "duration" : 187999
}

``

I am running Druid 0.9.0 on EC2. I do not have separate dedicated Hadoop cluster.

Relevant common configuration:

druid.extensions.loadList=["druid-s3-extensions", "mysql-metadata-storage"]

druid.storage.type=s3
druid.storage.bucket=
druid.storage.baseKey=druid/segments
druid.s3.accessKey=
druid.s3.secretKey=

druid.indexer.logs.type=s3

druid.indexer.logs.s3Bucket=
druid.indexer.logs.s3Prefix=druid/indexing-logs


``

Relevant Overload config:

druid.indexer.runner.type=remote


``

Relevant middleManager config:

druid.worker.capacity=3
druid.indexer.runner.javaOpts=-server -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
druid.indexer.task.baseTaskDir=
druid.server.http.numThreads=8
druid.processing.buffer.sizeBytes=256000000
druid.processing.numThreads=2
druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing
druid.indexer.task.defaultHadoopCoordinates=[“org.apache.hadoop:hadoop-client:2.3.0”]


``

Relevant parts of index job config:

{

 "type": "index_hadoop",

 "spec": {

   "ioConfig": {

     "type": "hadoop",

     "inputSpec": {

       "type": "static",

       "paths": "s3n://<my bucket>/experiment"

     }

   },

   "tuningConfig": {

     "type": "hadoop",

     "partitionsSpec" : {

       "type" : "hashed",

       "targetPartitionSize" : 5000000

     },

     "jobProperties" : {

       "fs.s3.awsAccessKeyId" : "<my key>",

       "fs.s3.awsSecretAccessKey" : "<my secret>",

       "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",

       "fs.s3n.awsAccessKeyId" : "<my key>",

       "fs.s3n.awsSecretAccessKey" : "<my secret>",

       "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",

      "io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"

     }

   }

 }

}

``

Thanks!

Do you have the full task log?