GC Selection and Tuning for Druid roles (CMS vs G1GC)

I have a general question about GC selection and configuration for production Druid deployments.

I’ll start of with linking this production configuration example from druid.io


Here, the Historical and Broker are using CMS. Could a case be made for flipping to G1GC for Broker and Historical? In our experience, we have seen some long GC pauses, especially on the Historical node, when using CMS. However, when we switched to G1GC in our staging environment, the Historical node stopped experiencing the extended pauses that were previously occurring.

Essentially, I am looking for an explanation/rationalization of the production cluster example linked above. Is there something that I am not considering when wanting to migrate to G1GC instead of CMS? This switch (to G1GC) has often been our approach for other large heap java processes at my org.

I would love to hear what other people have experienced/attempted in terms of GC selection and tuning in production.



There is reason to expect G1 to give shorter pause times than CMS in some situations, see https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html for details. Something not mentioned on that page, but that I think is an effect in reality, is that G1 is easier to tune than CMS. So even if CMS would be in theory better for a given workload, you might not have the patience to tune it properly.

The production cluster page on the Druid site is, sadly, out of date and in fact isn’t even in the Druid repo anymore. I think we just forgot to delete the html page from the web site. So I wouldn’t put too much stock in it. I’ll look into what we can do about that…