and coordination/historical disconnection.We’re seeing long GC pauses on the historical nodes, which cause zookeeper disconnection (Client session timed out, have not heard from server).
How about you enable GC logging and look at the big picture, like allocation rate, possible spikes, memory leaks …etc
Something like -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseGCLogFileRotation
You can use tools like https://gceasy.io/ or Jclarity to analyze the full GC log.
JMC TLAB Allocations shows a large lookup cache (NamespaceExtractionCacheManager ~6GB), smaller kafka-consumer, and many other 50-100MB objects which are probably intermediate processing and buffer merges (groupBy-XXX, processing-XXX).
But actually, I’m not sure if we should be confined to just 32GB, as these machines have much more RAM than that.
We will try 48GB or more to see the effect.
I wonder if anyone uses such large JVM heaps on historicals?