0.12.0 historical nodes old gen heap usage constantly increases

Hi all,

Ever since upgrading our cluster from 0.11 to 0.12 our historical nodes have exhibiting memory-leak-like symptoms. Has anyone else had a similar issue?

Our historical nodes have the following jvm config (abbreviated after the memory-related params):

-server

-d64

-Xmx32g

-Xms32g

-XX:MaxDirectMemorySize=58368m

-XX:MaxNewSize=16384m

-XX:+UseCompressedOops

-XX:+UseG1GC

-XX:MaxGCPauseMillis=200

-XX:InitiatingHeapOccupancyPercent=30

-XX:G1HeapWastePercent=10

-XX:ParallelGCThreads=20

-XX:ConcGCThreads=5

When we run with a production load everything seems fine at first and we don’t see any gc errors (even with verbosegc enabled) but after running for around 10 hours the old gen heap usage builds up to close to 32g, we start getting gc “allocation failure” messages and queries stall until we restart our historical services.

Heap dumps show a lot of java.nio.DirectByteBufferR instances and io.druid.segment.data.GenericIndexed instances.

One other possible symptom is that we seem to be accumulating a lot of druid-related zookeeper watches.

Any comments or suggestions would be much appreciated.

All the best,

Stuart McLean

(From LiquidM Technology GmbH, Berlin)

Hi Stuart,

Can you tell what those DirectByteBufferR and GenericIndexeds are rolling up to? In a tool like YourKit you should be able to look at what they are being retained by. It might point to something not cleaning up resources properly in some case.

Hi Gian,

Thanks very much for your reply.

It turns out we were betrayed by a misnamed setting (and maybe a little bit by the documentation). It says here that the default cache is “localhost” but the default in the code is caffeine. Caffeine’s druid.cache.sizeInBytes and druid.cache.expireAfter default to “unlimited” and “no time limit” respectively, so we didn’t see the source of memory usage until we ran a heap dump on a process that had been running for several hours. At that point the source of old-gen over-usage was obvious and we quickly found the mis-named parameter (which had unfortunately had a type injected while fixing another memory related issue).

Can we suggest that the documentation (and maybe the default params for caffeine) be updated? It might make sense to ensure caffeine doesn’t use more than a certain percentage of the allowed heap.

Thanks again and all the best,

Stuart

Hi Stuart,

Thanks for the great analysis. I agree that both of the changes you suggested would be good: fixing the docs, and changing the default size for Caffeine to be something other than unlimited (which is a silly default - unlimited is never good). Perhaps zero makes sense (i.e. no caching at all; meaning druid.cache.sizeInBytes becomes a mandatory parameter). Or perhaps min(1GB, Runtime.maxMemory / 10) makes sense.

Would you be willing to raise a PR for this?

Hi Gian,

First, here’s a pull for the documentation: https://github.com/druid-io/druid/pull/5737

I’ll try out an idea for the default cache size as well.

All the best,

Stuart

And here’s a pull for a default caffeine size - I liked your min(1GB, Runtime.maxMemory / 10) idea.