I have read that documentation, however it while it gives guidelines for a specific hardware configuration it isn’t clear how to map that to other hardware configurations. Druid’s memory requirements are very complex to understand and depend on many factors (and are different for each process): settings for buffers, number of threads, the number of dimensions in your data source, how many kafka partitions you are ingest etc. etc.
I have deployed Druid as Imply recommends with the Historical and MiddleManagers sharing a host. However, I am constantly having problems with the Historical node running out of memory and I do not know why. My understanding (probably incomplete/wrong) is that the Historical node requires:
Java Opts(Xmx + MaxDirectMemorySize) + Runtime Properties((druid.processing.numThreads + 1) * druid.processing.buffer.sizeBytes + druid.cache.sizeInBytes)
Bytes available to it to run. My hardware configuration is 64GB RAM, 16 cores and 500+ GB of disk. I have configured my Historical node as follows:
HTTP server threads
Processing threads and buffers
Make all query types cacheable. This is not the default, which excludes group-by and select quereis.
So it might be a terrible idea.
I changed the supervise config file to run ONLY the Historical node, and it still crashes with OOM. It does not receive any queries at all. All it does is loads all of the segments from cache, and then it starts to announce their availability. During the announcing phase it crashes with OOM and supervise restarts it. The only way to get the node back is to delete the segment cache, at which point it will work for roughly 24 hours and then it will fail again. It could be that I am badly misunderstanding how I have to configure it, but it seems to me that it should be able to start up no matter how badly I have configured its memory allocation. There is plenty of RAM available on the host, so I concluded that I hadn’t allocated sufficient heap space for the number of segments it has been asked to manage. But that doesn’t seem right - my historical node only has roughly 48GB in segments assigned to it.
I tried changing the configuration to:
And I still get OOM errors.
The Java dump shows:
Memory: 4k page, physical 131840252k(14636344k free), swap 4194300k(2312500k free)
So still 14GB of physical RAM free.