Druid GC errors on shared hosts

Enter code here…


Hi, right now we have deployed Druid to a container cluster and have historicals occupying individual nodes (fairly beefy nodes) in AWS. We used the following configuration to partition the box resources for each node. I’ll also add that some other resources utilize the same nodes, but are ancillary roles in Druid (middleManagers, etc). Right now we have the application error’ing out due to GC allocation failures on the historicals (presumably), but these errors appear in all historical logs at the same time. This is odd to me. Also, has anyone had any experience with working around / fixing GC allocation errors in Druid? The other issue with this is that most the RAM on these boxes (70% or more) are buffer / cached memory. They are highly underutilized VMs.

Here is a short summary of my questions:

  • Even with the host memory being mostly buffers/cache do heap sizes of individual containers / druid roles conflict with each other with such a soft memory limit when it comes to setting their cache size?

  • What are some monitoring solutions people are using to monitor druid node uptime?


Errors coming out of historical nodes across each host:

8/9/2017 11:01:44 AM1173889.702: [GC (Allocation Failure) 1173889.702: [ParNew: 5087839K->30310K(5662336K), 0.0176648 secs] 8679554K->3630385K(11953792K), 0.0178462 secs] [Times: user=0.23 sys=0.01, real=0.02 secs]
8/9/2017 11:24:20 AM1175244.975: [GC (Allocation Failure) 1175244.975: [ParNew: 5063526K->35158K(5662336K), 0.0147335 secs] 8663601K->3643372K(11953792K), 0.0148392 secs] [Times: user=0.20 sys=0.00, real=0.02 secs]


Here is the JVM configuration: