Druid broker java.lang.OutOfMemoryError: Java heap space

We are seeing Java heap space problem in our broker nodes. But instead of concluding that we need more memory I’m thinking I should ask it here first maybe the problem is somewhere else.

We’re running Druid 0.9.0

Here’s the exception I get

at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) [jetty-server-9.2.5.v20141112.jar:9.2.5.v20141112]

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) [jetty-server-9.2.5.v20141112.jar:9.2.5.v20141112]

at org.eclipse.jetty.server.Server.handle(Server.java:497) [jetty-server-9.2.5.v20141112.jar:9.2.5.v20141112]

at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) [jetty-server-9.2.5.v20141112.jar:9.2.5.v20141112]

at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248) [jetty-server-9.2.5.v20141112.jar:9.2.5.v20141112]

at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) [jetty-io-9.2.5.v20141112.jar:9.2.5.v20141112]

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:620) [jetty-util-9.2.5.v20141112.jar:9.2.5.v20141112]

at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:540) [jetty-util-9.2.5.v20141112.jar:9.2.5.v20141112]

at java.lang.Thread.run(Thread.java:745) [?:1.7.0_131]

Caused by: java.lang.OutOfMemoryError: Java heap space

``

Also, I think it’s because of the exception we also get this Exception

org.jboss.netty.channel.ConnectTimeoutException: connection timed out: /10.xx.xx.xx:8101

at org.jboss.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:139) [netty-3.10.4.Final.jar:?]

at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83) [netty-3.10.4.Final.jar:?]

at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) [netty-3.10.4.Final.jar:?]

at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) [netty-3.10.4.Final.jar:?]

at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.10.4.Final.jar:?]

at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.10.4.Final.jar:?]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_141]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_141]

at java.lang.Thread.run(Thread.java:748) [?:1.7.0_141]

``

10.xx.xx.xx is our index node.

Here’s our JVM opts

java -server -Xmx20g -Xms20g -XX:MaxDirectMemorySize=22g -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dcom.sun.management.jmxremote.port=17071 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -classpath .:/usr/local/lib/druid/lib/* io.druid.cli.Main server broker

``

We’re running broker nodes on c4.4xlarge. We have 4 of those running behind a ELB to serve queries.

Here’s our runtime.properties

This file is managed by Puppet

MODIFICATION WILL BE OVERWRITTEN

Node Configs

druid.host=10.xx.xx.xx

druid.port=8082

druid.service=druid/broker

Query Configs

druid.broker.balancer.type=connectionCount

druid.broker.select.tier=highestPriority

druid.server.http.numThreads=200

druid.server.http.maxIdleTime=PT5m

druid.broker.http.numConnections=100

druid.broker.http.readTimeout=PT15M

druid.broker.retryPolicy.numTries=1

druid.processing.buffer.sizeBytes=1073741824

druid.processing.formatString=processing-%s

druid.processing.numThreads=15

druid.processing.columnCache.sizeBytes=0

druid.query.groupBy.singleThreaded=false

druid.query.groupBy.maxIntermediateRows=50000

druid.query.groupBy.maxResults=700000

druid.query.search.maxSearchLimit=1000

Caching

druid.broker.cache.useCache=true

druid.broker.cache.populateCache=true

druid.cache.type=memcached

druid.broker.cache.unCacheable=[“groupBy”,“select”]

druid.cache.expiration=86400

druid.cache.timeout=500

druid.cache.maxObjectSize=52428800

druid.cache.memcachedPrefix=druid

``

We have about 40 million queries a day to the nodes. Almost all of them are TopN.

We use ElasticCached running on cache.r3.xlarge instance type

Hey Noppanit,

I think the best way to figure out if you really need more memory or not is to turn on -XX:+HeapDumpOnOutOfMemoryError and then see what objects are taking up the heap space once you get an OOME. If they seem “reasonable” then you need more memory. If they don’t then you should probably adjust configs or queries instead.

Thanks for the response. If it turns out we didn’t need more memory, what adjustments can we do for the config or the queries? Are you talking about the way we query data or if there’s Druid config that I’m missing?

I mean both. Some queries can really hog memory (like select queries, and topNs with very large thresholds). Also some Druid configs can use a lot of memory too (big caches, QTLs, etc).