Hi, we occasionally get OutOfMemoryErrors (for heap space) in our historical and broker nodes. Is this possibly due to misconfiguration, or is it expected to encounter them if you are getting load beyond the capacity that the cluster to handle?
If they are expected, are they supposed to be recoverable? It seems to us like we see problems occurring around the times we are seeing the OOM errors and often respond by restarting the cluster. (I don’t really have a solid set of symptoms to list here - it’s more anecdotal and it could just be things that are downstream of the OOM errors themselves that might clear up if given time.)
The most recent example occurred after we updated our QTL file and were too impatient to wait for the polling to pick up the changes, so we restarted the historicals. One of the two nodes suffered multiple OOM errors over the next hour and a half (from failing to load segments it looks like mostly, but hard to tell exactly from the logs). Then we rolled back the QTL file and restarted, though I’m not convinced the new QTL was the problem, as it seemed to load up just fine.
Are there particular things to look for in the Druid metrics to let us know that we are in danger of running out of memory, and/or clues to help figure out how much heap memory we ought to have? I mean, it may well just be that our cluster is underpowered, but I’d love to have some insight into how much we need to bump it up rather than relying on trial and error.
For general context - running 9.1.1, two historicals on m4.xlarges (2G heap though I’m going to bump it to 3G) and two brokers on m4.larges (2G heap), we do a lot of groupBy queries and are using a fairly large static QTL file (~174K rows in that file, 13 lookup columns, some of which are fairly high cardinality). ~90G total in storage so far, and our segment sizes are fairly low (100M-200M, I don’t think any are greater than 300M).