Hadoop ingestion error: java.lang.OutOfMemoryError: GC overhead limit exceeded

Hi My Hadoop ingestion tasks keep failing with the following error.
Do we know why this happens ? Thanks.



2018-11-10T05:10:38,223 INFO [LocalJobRunner Map Task Executor #0] io.druid.indexer.HadoopDruidIndexerConfig - Running with config:
{
  "spec" : {
    "dataSchema" : {
      "dataSource" : "abandonments",
      "parser" : {
        "type" : "hadoopyString",
        "parseSpec" : {
          "format" : "csv",
          "columns" : [ "date_id", "client_id", "second_of_day", "session_id", "ad_visitor_id", "ip_address", "referral_url", "referral_type", "referral_name", "natural_search_term", "browser", "os", "destination_url", "country", "state", "city", "language", "time_zone", "screen_resolution", "color_depth", "javascript_version", "user_agent", "mobile_device", "mobile_network", "connection_type", "device_vendor", "device_model", "device_marketing_name", "device_type", "order_number", "product_id", "product_name", "site_location_id", "product_quantity", "backorder_quantity", "base_price", "site_id" ],
          "dimensionsSpec" : {
            "dimensions" : [ "client_id", "second_of_day", "session_id", "ad_visitor_id", "ip_address", "referral_url", "referral_type", "referral_name", "natural_search_term", "browser", "os", "destination_url", "country", "state", "city", "language", "time_zone", "screen_resolution", "color_depth", "javascript_version", "user_agent", "mobile_device", "mobile_network", "connection_type", "device_vendor", "device_model", "device_marketing_name", "device_type", "order_number", "product_id", "product_name", "site_location_id", "site_id" ]
          },
          "timestampSpec" : {
            "format" : "yyyy-MM-dd'T'HH:mm:ss'Z'",
            "column" : "date_id"
          }
        }
      },
      "metricsSpec" : [ {
        "type" : "longSum",
        "name" : "abandonments_product_quantity",
        "fieldName" : "product_quantity",
        "expression" : null
      }, {
        "type" : "longSum",
        "name" : "abandonments_backorder_quantity",
        "fieldName" : "backorder_quantity",
        "expression" : null
      }, {
        "type" : "floatSum",
        "name" : "abandonments_base_price_total",
        "fieldName" : "base_price",
        "expression" : null
      }, {
        "type" : "count",
        "name" : "session_id_count"
      } ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "HOUR",
        "queryGranularity" : {
          "type" : "none"
        },
        "rollup" : true,
        "intervals" : [ "2008-09-12T00:00:00.000Z/2020-12-31T00:00:00.000Z" ]
      },
      "transformSpec" : {
        "filter" : null,
        "transforms" : [ ]
      }
    },
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "hdfs://dev02-ev0100.cm.emm.local:54010/druid_data/abandonments.txt"
      },
      "metadataUpdateSpec" : null,
      "segmentOutputPath" : "hdfs://dev02-ev0100.cm.emm.local:54010/druid/segments"
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "workingPath" : "var/druid/hadoop-tmp",
      "version" : "2018-11-10T05:10:23.557Z",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000,
        "maxPartitionSize" : 7500000,
        "assumeGrouped" : false,
        "numShards" : -1,
        "partitionDimensions" : [ ]
      },
      "shardSpecs" : { },
      "indexSpec" : {
        "bitmap" : {
          "type" : "concise"
        },
        "dimensionCompression" : "lz4",
        "metricCompression" : "lz4",
        "longEncoding" : "longs"
      },
      "maxRowsInMemory" : 75000,
      "leaveIntermediate" : false,
      "cleanupOnFailure" : true,
      "overwriteFiles" : false,
      "ignoreInvalidRows" : false,
      "jobProperties" : { },
      "combineText" : false,
      "useCombiner" : false,
      "buildV9Directly" : true,
      "numBackgroundPersistThreads" : 0,
      "forceExtendableShardSpecs" : false,
      "useExplicitVersion" : false,
      "allowedHadoopPrefix" : [ ]
    },
    "uniqueId" : "af69fd25383946baa0c21378b688dd7e"
  }
}
2018-11-10T05:10:38,325 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Job job_local1983385492_0001 running in uber mode : false
2018-11-10T05:10:38,326 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job -  map 0% reduce 0%
2018-11-10T05:10:38,443 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2018-11-10T05:10:38,696 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.LocalJobRunner -
2018-11-10T05:10:38,701 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - Starting flush of map output
2018-11-10T05:10:38,701 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - Spilling map output
2018-11-10T05:10:38,701 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - bufstart = 0; bufend = 2048811; bufvoid = 104857600
2018-11-10T05:10:38,701 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - kvstart = 26214396(104857584); kvend = 25783072(103132288); length = 431325/6553600
2018-11-10T05:10:42,740 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.MapTask - Finished spill 0
2018-11-10T05:10:42,785 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.Task - Task:attempt_local1983385492_0001_m_000000_0 is done. And is in the process of committing
2018-11-10T05:10:42,799 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.LocalJobRunner - map
2018-11-10T05:10:42,799 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.Task - Task 'attempt_local1983385492_0001_m_000000_0' done.
2018-11-10T05:10:42,799 INFO [LocalJobRunner Map Task Executor #0] org.apache.hadoop.mapred.LocalJobRunner - Finishing task: attempt_local1983385492_0001_m_000000_0
2018-11-10T05:10:42,799 INFO [Thread-57] org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2018-11-10T05:10:43,331 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job -  map 100% reduce 0%
2018-11-10T05:12:05,602 WARN [main-SendThread(10.111.232.114:2181)] **org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 27888ms for sessionid 0x166fb9c2fc9000c**
2018-11-10T05:12:38,758 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Job job_local1983385492_0001 failed with state FAILED due to: NA
2018-11-10T05:12:38,756 WARN [Thread-57] org.apache.hadoop.mapred.LocalJobRunner - job_local1983385492_0001
**java.lang.OutOfMemoryError: GC overhead limit exceeded**
	at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1019) ~[?:1.8.0_144]
	at java.util.concurrent.ConcurrentHashMap.putAll(ConcurrentHashMap.java:1084) ~[?:1.8.0_144]
	at java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:852) ~[?:1.8.0_144]
	at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:713) ~[hadoop-common-2.7.3.jar:?]
	at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:442) ~[hadoop-mapreduce-client-core-2.7.3.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.<init>(LocalJobRunner.java:294) ~[hadoop-mapreduce-client-common-2.7.3.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.getReduceTaskRunnables(LocalJobRunner.java:350) ~[hadoop-mapreduce-client-common-2.7.3.jar:?]
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:526) [hadoop-mapreduce-client-common-2.7.3.jar:?]
2018-11-10T05:12:10,254 WARN [main-SendThread(10.111.232.114:2181)] org.apache.zookeeper.server.ZooKeeperThread - Exception occurred from thread main-SendThread(10.111.232.114:2181)
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.apache.logging.log4j.core.util.datetime.FixedDateFormat.format(FixedDateFormat.java:270) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.pattern.DatePatternConverter$FixedFormatter.format(DatePatternConverter.java:72) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.pattern.DatePatternConverter$CachedTime.<init>(DatePatternConverter.java:103) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.pattern.DatePatternConverter.format(DatePatternConverter.java:207) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.pattern.DatePatternConverter.format(DatePatternConverter.java:201) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:36) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.layout.PatternLayout$PatternSerializer.toSerializable(PatternLayout.java:292) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.layout.PatternLayout.toSerializable(PatternLayout.java:206) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.layout.PatternLayout.toSerializable(PatternLayout.java:56) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.layout.AbstractStringLayout.toByteArray(AbstractStringLayout.java:148) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:112) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:152) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:125) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:116) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:390) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:378) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:362) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:352) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:147) ~[log4j-core-2.5.jar:2.5]
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1016) ~[log4j-api-2.5.jar:2.5]
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:964) ~[log4j-api-2.5.jar:2.5]
	at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:178) ~[log4j-slf4j-impl-2.5.jar:2.5]
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1156) ~[zookeeper-3.4.10.jar:3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f]

Followed by:

2018-11-10T06:09:39,310 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_abandonments_2018-11-10T06:07:08.620Z, type=index_hadoop, dataSource=abandonments}]

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException

at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:222) ~[druid-indexing-service-0.12.0.jar:0.12.0]

at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:184) ~[druid-indexing-service-0.12.0.jar:0.12.0]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:444) [druid-indexing-service-0.12.0.jar:0.12.0]

at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:416) [druid-indexing-service-0.12.0.jar:0.12.0]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_144]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_144]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_144]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_144]

at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_144]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:219) ~[druid-indexing-service-0.12.0.jar:0.12.0]

… 7 more

Caused by: io.druid.java.util.common.ISE: Job[class io.druid.indexer.DetermineHashedPartitionsJob] failed!

at io.druid.indexer.JobHelper.runJobs(JobHelper.java:390) ~[druid-indexing-hadoop-0.12.0.jar:0.12.0]

at io.druid.indexer.HadoopDruidDetermineConfigurationJob.run(HadoopDruidDetermineConfigurationJob.java:91) ~[druid-indexing-hadoop-0.12.0.jar:0.12.0]

at io.druid.indexing.common.task.HadoopIndexTask$HadoopDetermineConfigInnerProcessing.runTask(HadoopIndexTask.java:325) ~[druid-indexing-service-0.12.0.jar:0.12.0]

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_144]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_144]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_144]

at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_144]

at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:219) ~[druid-indexing-service-0.12.0.jar:0.12.0]

… 7 more

2018-11-10T06:09:39,320 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_abandonments_2018-11-10T06:07:08.620Z] status changed to [FAILED].

2018-11-10T06:09:39,324 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {

“id” : “index_hadoop_abandonments_2018-11-10T06:07:08.620Z”,

“status” : “FAILED”,

“duration” : 142425

}

This indicates that the worker task probably needs more heap:

2018-11-10T05:12:38,756 WARN [Thread-57] org.apache.hadoop.mapred.LocalJobRunner - job_local1983385492_0001
java.lang.OutOfMemoryError: GC overhead limit exceeded

However, the task is using LocalJobRunner (doing the map-reduce with Druid’s single worker task process instead of using the Hadoop cluster, I assume you have a hadoop cluster since the task reads from HDFS). If that’s not intentional, I would double-check that you’ve put your hadoop cluster’s .xml config files in your Druid classpath.

I would also suggest using a bigger segment granularity than HOUR, since you’re ingesting a 12-year interval, splitting this task into multiple tasks for smaller intervals can also help.

 "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "HOUR",
        "queryGranularity" : {
          "type" : "none"
        },
        "rollup" : true,
        "intervals" : [ "2008-09-12T00:00:00.000Z/2020-12-31T00:00:00.000Z" ]
      },