Hi,
I would like to know about heap config in data server historical.
Server details for data server: 2 data servers of type: i3.2xlarge.
As per druid site, max possible heap is 24 Gb. I have about 50GB of data size on druid after all the tables loaded.
Also please note that there is no stream ingestion and only batch ingestion.
(1) So, if I set heap to 24GB, does that mean that 24GB of segments are loaded to heap from deep storage(s3)?
(2) What happens to remaining 26GB (50-24)? Are they remaining in S3 or are loaded to SSD of i3 instance?
(3) Is it possible to load all 50GB segments to in-memory so my query is fast(sub-second)? What should be the configuration for this?
(4) Should I also configure the direct memory for loading segments to in-memory?
i3.2xlarge has 61GB RAM and 8 vcpu. The heap is not used to load segments. It is used for computation. The direct memory is used for aggregation buffers. Total RAM-heap-direct memory=available page cache for memory mapping segments. In your case I would set heap to 4 GB (0.5 GB per vcpu) and direct memory to 13 GB (setting buffer size to 1 Gb and number of merge buffers to 4) With this you will have 61-17=44 GB available for the memory mapping of segments. If you have total of 50 GB data then this should be enough.
Also, if set the replication size to 1 will that allow more data to memory? Will that be a good strategy considering the fact that it is only batch load at the moment and losing one instance will have less of an impact?
One last question related to page cache you mentioned above.
Say there is not enough RAM memory to do mmap. So, OS kernel will load some data to OS cache and when a particular data is not in cache it will result a page fault and new required data will be loaded from disk to OS cache. Is this understanding, correct?