java.lang.OutOfMemoryError: Java heap space

When I do the following config changes in druid Historicals service, what I expect is to ingest data of 60 million rows but instead it is throwing java.lang.OutOfMemoryError: Java heap space.

  • -XX:MaxDirectMemorySize=10240g
  • Step 2druid.processing.buffer.sizeBytes=500MiB
  • druid.processing.numMergeBuffers=8
Things I've tried
  • Tried with the above set of configuration changes in the druid broker service and the task will remain in the state of PENDING infinitely
Logs **2022-08-10T12:46:13,651 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in DETERMINE_PARTITIONS.** **java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space** ** at org.apache.druid.data.input.impl.InputEntityIteratingReader.lambda$read$0(InputEntityIteratingReader.java:81) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.java.util.common.parsers.CloseableIterator$2.findNextIteratorIfNecessary(CloseableIterator.java:84) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.java.util.common.parsers.CloseableIterator$2.(CloseableIterator.java:69) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.java.util.common.parsers.CloseableIterator.flatMap(CloseableIterator.java:67) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.data.input.impl.InputEntityIteratingReader.createIterator(InputEntityIteratingReader.java:103) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.data.input.impl.InputEntityIteratingReader.read(InputEntityIteratingReader.java:74) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.segment.transform.TransformingInputSourceReader.read(TransformingInputSourceReader.java:43) ~[druid-processing-0.22.1.jar:0.22.1]**

Seeking an expertise here as this issue has become a blocker. Kindly note that we are able to ingest data of 20 million rows with the above configuration without any hurdles

What more configuration to be included to be able to process 60 million rows of data

Thanks,
Keerthi Kumar N

Hi Keerthi,

Historicals are not involved during ingestion, have you reviewed configuration for the Middle Managers.
This is a good guide:

An often missed configuration step is the druid.indexer.runner.javaOptsArray parameter that configures the JVM for the MM workers (peons) which are separate from the Middle Manager itself. These are the actual ingestion JVMs, so tuning their parameters tends to be critical:
Configuration reference · Apache Druid

Hello @Sergio_Ferragut ,

Sorry for coming back late. I went thru the above shared link by you regarding the configuration part. However, I am facing the below exception now which will eventually lead to OutOfMemory issue.

2022-09-05T10:18:08,766 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
2022-09-05T10:18:08,769 WARN [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_SUSPENDED] for node watcher of role[coordinator].
2022-09-05T10:18:08,770 WARN [NodeRoleWatcher[OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_SUSPENDED] for node watcher of role[overlord].

Could you please help me what is going wrong here as I am feeling baffled to understand the configuration part for optimizing the data ingestion process. Below are the config details:

BROKER
jvm.config: |-
-server
-XX:MaxDirectMemorySize=10240g
-XX:+ExitOnOutOfMemoryError
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data
-Xmx8g
-Xms8g

druid.broker.http.numConnections=20
druid.server.http.numThreads=50

COORDINATOR
jvm.config: |-
-server
-XX:MaxDirectMemorySize=10240g
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data
-Xmx10g
-Xms10g
-XX:NewSize=512m
-XX:NewSize=512m
-XX:+UseG1GC
-XX:+ExitOnOutOfMemoryError
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps

Task launch parameters
druid.indexer.runner.javaOpts=-server 
-Xmx3g 
-XX:MaxDirectMemorySize=4096m 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintGCDateStamps 
-Duser.timezone=UTC 
-Dfile.encoding=UTF-8

HISTORICALS

jvm.config: |-
-server
-XX:MaxDirectMemorySize=10240g
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data
-Xmx12g
-Xms12g
-XX:NewSize=6g
-XX:NewSize=6g
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps

druid.segmentCache.locations=[{“path”:“/druid/data/segments”,“maxSize”:130000000000}]
druid.server.maxSize=130000000000

Request everyone in this group to help me out in this issue.

Thanks,
Keerthi Kumar N

Hello Team… any help here please???

Is this a typo in the configs?

-XX:MaxDirectMemorySize=10240g

Do you have terabytes of RAM? Usually MaxDirectMemorySize is set to something less than Xmx, say 1/2 of it.

After that, if you’re getting “out of heap space”, you can try increasing the heap that is involved. (Which log is showing the OOM message?)

Hello @Ben_Krug,

Thanks for the response. Yes this is what is being set in all the services of druid - coordinator, broker, historicals & routers. It was in coordinator where the OOM was shown.

Any leads here pls??

Thanks,
Keerthi Kumar N

What happens if you set MaxDirectMemory to 1/2 the heap size for each component? Broker 4G, coordinator 5G, tasks 1G, historical 3G?

Hello @Ben_Krug,

I am not understanding what you are trying to say here ! In our druid environment for all the druid services the attribute MaxDirectMemorySize is set to 1040g (to be honest not sure why). So as per your above statement to what value the same attribute needs to be changed to and validated?? Kindly confirm on the same.

Thanks,
Keerthi Kumar N

@Ben_Krug,

As suggested by you in your response above I configured the changes as mentioned and below is the response:

Caused by: java.lang.OutOfMemoryError: Java heap space
2022-09-07T06:48:47,296 WARN [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - handler[index_parallel_test_kk_60mn_heofnidc_2022-09-07T05:59:10.938Z] not currently registered, ignoring.
2022-09-07T06:48:47,298 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_parallel_test_kk_60mn_heofnidc_2022-09-07T05:59:10.938Z”,
“status” : “FAILED”,
“duration” : 2970184,
“errorMsg” : “java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: java.lang.RuntimeExcept…”,
“location” : {
“host” : null,
“port” : -1,
“tlsPort” : -1
}
}
2022-09-07T06:48:47,304 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2022-09-07T06:48:47,306 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2022-09-07T06:48:47,309 WARN [main-SendThread(10.233.96.110:2181)] org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x10492a1ff50000b has expired
2022-09-07T06:48:47,313 INFO [main] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@296a71df{HTTP/1.1, (http/1.1)}{0.0.0.0:8100}
2022-09-07T06:48:47,313 INFO [main] org.eclipse.jetty.server.session - node0 Stopped scavenging
2022-09-07T06:48:47,315 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@6f740044{/,null,STOPPED}
2022-09-07T06:48:47,319 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2022-09-07T06:48:47,319 INFO [main] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_parallel_test_kk_60mn_heofnidc_2022-09-07T05:59:10.938Z].
2022-09-07T06:48:47,410 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper - Session: 0x10492a1ff50000b closed
2022-09-07T06:48:47,410 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=10.233.108.192:2181,10.233.96.110:2181,10.233.69.195:2181 sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@7744195
2022-09-07T06:48:47,410 INFO [main-EventThread] org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
2022-09-07T06:48:47,411 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=
2022-09-07T06:48:47,411 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: LOST
2022-09-07T06:48:47,411 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x10492a1ff50000b
2022-09-07T06:48:47,430 INFO [main-SendThread(10.233.69.195:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10-233-69-195.zookeeper-service.performance.svc.cluster.local/10.233.69.195:2181. Will not attempt to authenticate using SASL (unknown error)
2022-09-07T06:48:47,431 INFO [main-SendThread(10.233.69.195:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.233.96.157:47842, server: 10-233-69-195.zookeeper-service.performance.svc.cluster.local/10.233.69.195:2181
2022-09-07T06:48:47,435 INFO [main-SendThread(10.233.69.195:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server 10-233-69-195.zookeeper-service.performance.svc.cluster.local/10.233.69.195:2181, sessionid = 0x30491d91365000a, negotiated timeout = 30000
2022-09-07T06:48:47,436 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED

It looks like the peons (task runners) are getting this? Can you increase heap size for peons (under druid.indexer.runner.javaOpts)?

This is a runtime error in Java which occurs when you allocate a new object in your application over a period of time continuously and the Garbage Collector (GC) cannot make space available to accommodate a new object, and the heap cannot be expanded further, which resulted this error.

Therefore you pretty much have the following options:

  • Find the root cause of memory leaks with help of profiling tools like MAT, Visual VM , jconsole etc. Once you find the root cause, You can fix this memory leaks.
  • Optimize your code so that it needs less memory, using less big data structures and getting rid of objects that are not any more used at some point in your program.
  • Increase the default memory your program is allowed to use using the -Xmx option (for instance for 1024 MB: -Xmx1024m). By default, the values are based on the JRE version and system configuration.

Increasing the heap size is a bad solution, 100% temporary, because you will hit the same issue if you get several parallel requests or when you try to process a bigger file. To avoid OutOfMemoryError, write high performance code:

  • Use local variables wherever possible.
  • Release those objects which you think shall not be needed further.
  • Avoid creation of objects in your loop each time.

Welcome @marktoddy!

Can we please get some clarification on this? The heap sizing discussion in this thread regards a Druid process, not the writing of Druid code.

As an example, here’s some language relevant to druid.indexer.runner.javaOptsArray:

The amount of direct memory needed by Druid is at least druid.processing.buffer.sizeBytes * (druid.processing.numMergeBuffers + druid.processing.numThreads + 1) . You can ensure at least this amount of direct memory is available by providing -XX:MaxDirectMemorySize=<VALUE> in druid.indexer.runner.javaOptsArray as documented above.

I’ve linked to the Middle Manager config doc for further explanation.

To provide a bit more context, the Middle Manager launches task processes, which in turn perform ingestion work. This is an example of where things become use case specific, and, therefore, difficult to generalize:

The number of tasks a MiddleManager can launch is controlled by the druid.worker.capacity setting.

The number of workers needed in your cluster depends on how many concurrent ingestion tasks you need to run for your use cases. The number of workers that can be launched on a given machine depends on the size of resources allocated per worker and available system resources.

You can allocate more MiddleManager machines to your cluster to add task capacity.

Use cases vary, and you can infer a wide variety of them by reading the Powered by Apache Druid page.

You can read a bit more about Java runtime for Druid here. I’ll highlight the following language for anyone who doesn’t want to follow the link:

Additionally, tasks run by MiddleManagers execute in separate JVMs. The command line for these JVMs is given by druid.indexer.runner.javaOptsArray or druid.indexer.runner.javaOpts in middleManager/runtime.properties. Java command line parameters for tasks must be specified here. For example, use a line like the following:

druid.indexer.runner.javaOptsArray=["-server","-Xms1g","-Xmx1g","-XX:MaxDirectMemorySize=1g","-Duser.timezone=UTC","-Dfile.encoding=UTF-8","-XX:+ExitOnOutOfMemoryError","-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager","--add-exports=java.base/jdk.internal.ref=ALL-UNNAMED","--add-exports=java.base/jdk.internal.misc=ALL-UNNAMED","--add-opens=java.base/java.lang=ALL-UNNAMED","--add-opens=java.base/java.io=ALL-UNNAMED","--add-opens=java.base/java.nio=ALL-UNNAMED","--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED","--add-opens=java.base/sun.nio.ch=ALL-UNNAMED"]

The Xms, Xmx, and MaxDirectMemorySize parameters in the line above are merely an example. You may use different values in your specific environment.

Druid is a community effort, and contributions are always welcome: