Druid Historical node potential memory leak

Hi,

Running one historical node on a m4.4xlarge machine (16 core, 64gb). After working for 1-2 days, there were OOM errors. The GC print details showed the problem could be the young generation size, so I raised that up to 6gb as per the druid production cluster configuration.

So after a few more days, I see lots of GC’s again. Ran jstat and got the following (even after running a manual GC with jcmd GC.run):

S0 S1 E O M CCS YGC YGCT FGC FGCT GCT

0.00 0.00 5.73 99.99 97.67 94.52 1495 48.873 103 100.036 148.908

The old generation capacity is at 99.99%. Only restarting the server solved this (something that is not feasible to do every 1-2 days).

This is current jvm.config file:

-server

-Xms16g

-Xmx16g

-XX:NewSize=6g

-XX:MaxNewSize=6g

-XX:MaxDirectMemorySize=12g

-XX:+UseConcMarkSweepGC

-XX:+PrintGCDetails

-XX:+PrintGCTimeStamps

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=var/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

This is the runtime.properties file (rest of the values are default):

druid.service=druid/historical

druid.port=8083

HTTP server threads

druid.server.http.numThreads=20

Processing threads and buffers

druid.processing.buffer.sizeBytes=536870912

Segment storage

druid.segmentCache.locations=[{“path”:“var/druid/segment-cache”,“maxSize”:130000000000}]

druid.server.maxSize=130000000000

Query cache

druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

druid.cache.type=local

druid.cache.sizeInBytes=2000000000

druid.monitoring.monitors=[“com.metamx.metrics.JvmMonitor”, “io.druid.server.metrics.HistoricalMetricsMonitor”]

These are the extensions being used:

druid.extensions.loadList=[“druid-s3-extensions”, “mysql-metadata-storage”, “druid-histogram”, “druid-datasketches”, “druid-kafka-indexing-service”, “druid-lookups-cached-global”, “graphite-emitter” ]

I’d appreciate any insight on where to look for what could cause the problem. Perhaps one of the extensions? Currently running Imply 2.0.0 (Druid 0.9.2). Total size of data in deep storage is 8gb~ of segments. Free memory on the machine when there are GC’s is around 20gb. I’ve constantly been raising the heap size.

Thanks in advance

Itamar

Can you analyze a heap dump and see what’s taking up all the space? My first guess would be something related to one of the extensions, like druid-lookups-cached-global or graphite-emitter.

Hi Gian,

Will do. These were my thoughts as well (as these are new issues). I will add the -XX:HeapDumpOnOutOfMemoryError flag and wait for it to reoccur. Will keep this post updated. Thanks!

Itamar

Hi Itamar, Did you get what exactly the issue was ? I am having similar problem and raising heap size every few days now. I am having 9 GB of data in historical node and heap size currently it is using is 19 GB :frowning:

Regards,

Arpan Khagram