java.lang.OutOfMemoryError: Java heap space

When I do the following config changes in druid Historicals service, what I expect is to ingest data of 60 million rows but instead it is throwing java.lang.OutOfMemoryError: Java heap space.

  • -XX:MaxDirectMemorySize=10240g
  • Step 2druid.processing.buffer.sizeBytes=500MiB
  • druid.processing.numMergeBuffers=8
Things I've tried
  • Tried with the above set of configuration changes in the druid broker service and the task will remain in the state of PENDING infinitely
Logs **2022-08-10T12:46:13,651 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in DETERMINE_PARTITIONS.** **java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space** ** at org.apache.druid.data.input.impl.InputEntityIteratingReader.lambda$read$0(InputEntityIteratingReader.java:81) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.java.util.common.parsers.CloseableIterator$2.findNextIteratorIfNecessary(CloseableIterator.java:84) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.java.util.common.parsers.CloseableIterator$2.(CloseableIterator.java:69) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.java.util.common.parsers.CloseableIterator.flatMap(CloseableIterator.java:67) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.data.input.impl.InputEntityIteratingReader.createIterator(InputEntityIteratingReader.java:103) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.data.input.impl.InputEntityIteratingReader.read(InputEntityIteratingReader.java:74) ~[druid-core-0.22.1.jar:0.22.1]** ** at org.apache.druid.segment.transform.TransformingInputSourceReader.read(TransformingInputSourceReader.java:43) ~[druid-processing-0.22.1.jar:0.22.1]**

Seeking an expertise here as this issue has become a blocker. Kindly note that we are able to ingest data of 20 million rows with the above configuration without any hurdles

What more configuration to be included to be able to process 60 million rows of data

Thanks,
Keerthi Kumar N

Hi Keerthi,

Historicals are not involved during ingestion, have you reviewed configuration for the Middle Managers.
This is a good guide:

An often missed configuration step is the druid.indexer.runner.javaOptsArray parameter that configures the JVM for the MM workers (peons) which are separate from the Middle Manager itself. These are the actual ingestion JVMs, so tuning their parameters tends to be critical:
Configuration reference · Apache Druid

Hello @Sergio_Ferragut ,

Sorry for coming back late. I went thru the above shared link by you regarding the configuration part. However, I am facing the below exception now which will eventually lead to OutOfMemory issue.

2022-09-05T10:18:08,766 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
2022-09-05T10:18:08,769 WARN [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_SUSPENDED] for node watcher of role[coordinator].
2022-09-05T10:18:08,770 WARN [NodeRoleWatcher[OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_SUSPENDED] for node watcher of role[overlord].

Could you please help me what is going wrong here as I am feeling baffled to understand the configuration part for optimizing the data ingestion process. Below are the config details:

BROKER
jvm.config: |-
-server
-XX:MaxDirectMemorySize=10240g
-XX:+ExitOnOutOfMemoryError
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data
-Xmx8g
-Xms8g

druid.broker.http.numConnections=20
druid.server.http.numThreads=50

COORDINATOR
jvm.config: |-
-server
-XX:MaxDirectMemorySize=10240g
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data
-Xmx10g
-Xms10g
-XX:NewSize=512m
-XX:NewSize=512m
-XX:+UseG1GC
-XX:+ExitOnOutOfMemoryError
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps

Task launch parameters
druid.indexer.runner.javaOpts=-server 
-Xmx3g 
-XX:MaxDirectMemorySize=4096m 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintGCDateStamps 
-Duser.timezone=UTC 
-Dfile.encoding=UTF-8

HISTORICALS

jvm.config: |-
-server
-XX:MaxDirectMemorySize=10240g
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data
-Xmx12g
-Xms12g
-XX:NewSize=6g
-XX:NewSize=6g
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps

druid.segmentCache.locations=[{“path”:“/druid/data/segments”,“maxSize”:130000000000}]
druid.server.maxSize=130000000000

Request everyone in this group to help me out in this issue.

Thanks,
Keerthi Kumar N

Hello Team… any help here please???

Is this a typo in the configs?

-XX:MaxDirectMemorySize=10240g

Do you have terabytes of RAM? Usually MaxDirectMemorySize is set to something less than Xmx, say 1/2 of it.

After that, if you’re getting “out of heap space”, you can try increasing the heap that is involved. (Which log is showing the OOM message?)

Hello @Ben_Krug,

Thanks for the response. Yes this is what is being set in all the services of druid - coordinator, broker, historicals & routers. It was in coordinator where the OOM was shown.

Any leads here pls??

Thanks,
Keerthi Kumar N

What happens if you set MaxDirectMemory to 1/2 the heap size for each component? Broker 4G, coordinator 5G, tasks 1G, historical 3G?

Hello @Ben_Krug,

I am not understanding what you are trying to say here ! In our druid environment for all the druid services the attribute MaxDirectMemorySize is set to 1040g (to be honest not sure why). So as per your above statement to what value the same attribute needs to be changed to and validated?? Kindly confirm on the same.

Thanks,
Keerthi Kumar N

@Ben_Krug,

As suggested by you in your response above I configured the changes as mentioned and below is the response:

Caused by: java.lang.OutOfMemoryError: Java heap space
2022-09-07T06:48:47,296 WARN [task-runner-0-priority-0] org.apache.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - handler[index_parallel_test_kk_60mn_heofnidc_2022-09-07T05:59:10.938Z] not currently registered, ignoring.
2022-09-07T06:48:47,298 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
“id” : “index_parallel_test_kk_60mn_heofnidc_2022-09-07T05:59:10.938Z”,
“status” : “FAILED”,
“duration” : 2970184,
“errorMsg” : “java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: java.lang.RuntimeExcept…”,
“location” : {
“host” : null,
“port” : -1,
“tlsPort” : -1
}
}
2022-09-07T06:48:47,304 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2022-09-07T06:48:47,306 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
2022-09-07T06:48:47,309 WARN [main-SendThread(10.233.96.110:2181)] org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x10492a1ff50000b has expired
2022-09-07T06:48:47,313 INFO [main] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@296a71df{HTTP/1.1, (http/1.1)}{0.0.0.0:8100}
2022-09-07T06:48:47,313 INFO [main] org.eclipse.jetty.server.session - node0 Stopped scavenging
2022-09-07T06:48:47,315 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@6f740044{/,null,STOPPED}
2022-09-07T06:48:47,319 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [NORMAL]
2022-09-07T06:48:47,319 INFO [main] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting graceful shutdown of task[index_parallel_test_kk_60mn_heofnidc_2022-09-07T05:59:10.938Z].
2022-09-07T06:48:47,410 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper - Session: 0x10492a1ff50000b closed
2022-09-07T06:48:47,410 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=10.233.108.192:2181,10.233.96.110:2181,10.233.69.195:2181 sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@7744195
2022-09-07T06:48:47,410 INFO [main-EventThread] org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
2022-09-07T06:48:47,411 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=
2022-09-07T06:48:47,411 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: LOST
2022-09-07T06:48:47,411 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x10492a1ff50000b
2022-09-07T06:48:47,430 INFO [main-SendThread(10.233.69.195:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10-233-69-195.zookeeper-service.performance.svc.cluster.local/10.233.69.195:2181. Will not attempt to authenticate using SASL (unknown error)
2022-09-07T06:48:47,431 INFO [main-SendThread(10.233.69.195:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.233.96.157:47842, server: 10-233-69-195.zookeeper-service.performance.svc.cluster.local/10.233.69.195:2181
2022-09-07T06:48:47,435 INFO [main-SendThread(10.233.69.195:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server 10-233-69-195.zookeeper-service.performance.svc.cluster.local/10.233.69.195:2181, sessionid = 0x30491d91365000a, negotiated timeout = 30000
2022-09-07T06:48:47,436 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED

It looks like the peons (task runners) are getting this? Can you increase heap size for peons (under druid.indexer.runner.javaOpts)?