We’re running our historical nodes in GKE with each historical server as the only thing running on a node, storing segments in local SSD (with GCS for deep storage).
Right now we’re putting a single SSD (375GB) on each node, and that’s the current determinant of how many nodes to run. However, GKE supports up to 8 SSDs per node, so we’re considering running with more storage per node to reduce node costs (as machines are much more expensive than SSDs!). Each node has 60GB of memory, most of which we allocate to Druid, so the amount of our data we’re storing in memory is already far lower than the amount we store on disk.
I can look at how much CPU the nodes use and see how much headroom we have there, but because Druid historicals use memory in various ways including the Linux page cache, I’m not really sure how to do the same analysis for memory. (I have read http://druid.io/docs/latest/operations/performance-faq.html.) What should I be looking at to see if with our usage patterns it would be reasonable to double, quadruple, or octuple the amount of data stored on each node? (Other than “end to end performance of our queries” of course.)