How to tell how loaded a historical is / how much data to put on a node

We’re running our historical nodes in GKE with each historical server as the only thing running on a node, storing segments in local SSD (with GCS for deep storage).

Right now we’re putting a single SSD (375GB) on each node, and that’s the current determinant of how many nodes to run. However, GKE supports up to 8 SSDs per node, so we’re considering running with more storage per node to reduce node costs (as machines are much more expensive than SSDs!). Each node has 60GB of memory, most of which we allocate to Druid, so the amount of our data we’re storing in memory is already far lower than the amount we store on disk.

I can look at how much CPU the nodes use and see how much headroom we have there, but because Druid historicals use memory in various ways including the Linux page cache, I’m not really sure how to do the same analysis for memory. (I have read http://druid.io/docs/latest/operations/performance-faq.html.) What should I be looking at to see if with our usage patterns it would be reasonable to double, quadruple, or octuple the amount of data stored on each node? (Other than “end to end performance of our queries” of course.)

–dave

Hi David,

The ratio of memory:disk would depend on the type of queries and the amount of data being queried.

The historical segment scan time and the amount of data being read from the disk would give a better idea of the historical performance.

Druid emits many metrics (http://druid.io/docs/latest/operations/metrics.html). The useful ones for paging would be query/segment/time and sys/disk/read.

From these you should be able to have a good understanding on how changing the memory setting affects the performance for your use case.