Batch Ingestion fails due to unknown error

Hi,

I am using Druid to ingest 1TB worth of files from S3 by using a local cluster on a c4.4xlarge machine. The error happens after ingesting 1400 files, and it just stops ingesting any more files from there. The error doesn’t seem to indicate what is wrong:

2017-04-16T20:09:35,104 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_searches_2017-04-14T02:26:05.089Z, type=index_hadoop, dataSource=searches}]

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException

at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:204) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:208) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.2.jar:0.9.2]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.2.jar:0.9.2]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_05]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_05]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_05]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_05]

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_05]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_05]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_05]

at java.lang.reflect.Method.invoke(Method.java:483) ~[?:1.8.0_05]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2.jar:0.9.2]

… 7 more

Caused by: com.metamx.common.ISE: Job[class io.druid.indexer.IndexGeneratorJob] failed!

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:369) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2]

at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.2.jar:0.9.2]

at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.2.jar:0.9.2]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_05]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_05]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_05]

at java.lang.reflect.Method.invoke(Method.java:483) ~[?:1.8.0_05]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.2.jar:0.9.2]

… 7 more

2017-04-16T20:10:02,611 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_searches_2017-04-14T02:26:05.089Z] status changed to [FAILED].

2017-04-16T20:10:39,674 WARN [Curator-Framework-0] org.apache.curator.ConnectionState - Connection attempt unsuccessful after 232110 (greater than max timeout of 30000). Resetting connection and trying again with a new connection.

2017-04-16T20:10:40,232 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_hadoop_searches_2017-04-14T02:26:05.089Z”,

“status” : “FAILED”,

“duration” : 236634704

}

``

Settings wise, I am using the conf settings recommended and reduced the heap size requirements a little so that the memory consumed remains within the capacity of the machine.

I am currently using the following spec to ingest the files from S3, and have proven that it works on a single file from S3 when changing the path to consume one file. However, when trying to ingest 1TB worth of files (total of 10000+ files) from S3, we get the error.

{

“type”: “index_hadoop”,

“spec”: {

“ioConfig”: {

“type”: “hadoop”,

“inputSpec”: {

“type”: “static”,

“paths”: “s3n://<ACCESS_KEY>:<SECRET_ACCESS_KEY>@//dt=2017-04-04/*”

}

},

“dataSchema”: {

“dataSource”: “searches”,

“granularitySpec”: {

“type”: “uniform”,

“segmentGranularity”: “DAY”,

“queryGranularity”: “DAY”,

“intervals”: [“2017-04-04/2017-04-05”]

},

“parser”: {

“type”: “string”,

“parseSpec”: {

“format”: “json”,

“flattenSpec”: {

“useFieldDiscovery”: true,

“fields”: [

{

“type”: “path”,

“name”: “timestamp”,

“expr”: “$.eventHeader.createdOn.unixTimeMillis”

},

{

“type”: “path”,

“name”: “id”,

“expr”: “$.downstreamId.identifier”

},

{

“type”: “path”,

“name”: “type”,

“expr”: “$.type”

},

{

“type”: “path”,

“name”: “exceptionType”,

“expr”: “$.exception.exceptionType”

},

{

“type”: “path”,

“name”: “qCount”,

“expr”: “$.qCount”

},

{

“type”: “path”,

“name”: “region”,

“expr”: “$.eventHeader.serviceInstance.region”

},

{

“type”: “path”,

“name”: “searchKind”,

“expr”: “$.search.kind”

},

{

“type”: “path”,

“name”: “engine”,

“expr”: “$.engineName”

},

{

“type”: “path”,

“name”: “requestClient”,

“expr”: “$.requestClientKind”

}

]

},

“dimensionsSpec”: {

“dimensions”: [

“id”,

“type”,

“exceptionType”,

“region”,

“searchKind”,

“engine”,

“requestClient”

],

“dimensionExclusions” : ,

“spatialDimensions” :

},

“timestampSpec”: {

“format”: “millis”,

“column”: “timestamp”

}

}

},

“metricsSpec”: [

{

“name”: “count”,

“type”: “count”

},

{

“type” : “longSum”,

“name” : “qCountSum”,

“fieldName” : “qCount”

}

]

},

“tuningConfig”: {

“type”: “hadoop”,

“jobProperties”: {

“fs.s3n.awsAccessKeyId”: “<ACCESS_KEY>”,

“fs.s3n.awsSecretAccessKey”: “<SECRET_KEY>”,

“fs.s3n.impl”: “org.apache.hadoop.fs.s3native.NativeS3FileSystem”

}

}

}

}

``

Hopefully, there is someone out there who knows about the possible issue related to this.

Thanks!

Yong Cheng