The persistent OOM errors that I have been seeing in my staging environment have now presented in the production environment as well and I urgently need some help to resolve them.
I have been playing around unsuccessfully with the jvm parameters, but I always end up with OOM conditions. My Historical nodes are not even taking any queries, they start up, load their segments, announce their segments, get an OOM error and restart.
I am trying to run them in a container with 42g of RAM, and it does not matter if I set Xmx + MaxDirectMemorySize + druid.cache.sizeInBytes to something well below capacity of the container - I still get OOM. I have even tried running this same configuration (below) in a container with 84g RAM and still have the same issue. I have also tried to set Xss and MaxMetaspaceSize to see if that resolves the problem and it does not.
From the error file: