Historical node freezing on i3.8xlarge (NVMe -- SSD)


We have a problem with druid historical nodes freezing specifically when run on i3.8xlarge ec2 instances. These instances have large fast NVMe SSD drives. We monitor the historical node responsiveness through the “/status” endpoint and when it is unresponsive, we first try to get a stack trace from it using “kill -3” on the PID before we kill and restart the process. Strangely, we never get a stack trace in the log when the “/status” endpoint is unresponsive and there are never any exceptions in the logs. Today, however I found the following exception in the historical log.

A fatal error has been detected by the Java Runtime Environment: