Historical node memory calculation

Hello.

I need help to adjust memory parameters for historical nodes.

I am running historical nodes on AWS r3.large (15.25GiB) instances.

Current historical runtime.properties:

druid.processing.buffer.sizeBytes=1073741824

druid.processing.numThreads=1

druid.extensions.loadList=[“druid-s3-extensions”]

druid.segmentCache.locations=[{“path”: “/mnt/druid/indexCache”, “maxSize”: 26843545600}]

druid.server.maxSize=26843545600

jvm:

-Xmx4g -Xms4g -XX:MaxDirectMemorySize=6g -XX:NewSize=2g -XX:MaxNewSize=2g -XX:+UseConcMarkSweepGC

Using these settings above historical node crashes from time to time with OOM:

There is insufficient memory for the Java Runtime Environment to continue.

Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.

Documentation configuration example is for much bigger host, but in our case I think we do not need such host, usage is quite small and indexCache is 6GB at the moment.

Is there a formula or something how to calculate correct values based on available memory? Or which values should be smaller?

Thanks,

Tarvi

I have been seeing similar problems and would also love to see documentation around RAM utilization for all of the Druid processes, although I have had most problems with historical and peons (specifically with the kafka indexing service tasks).

–Ben

Examples of how to configure for different hardware described here: https://imply.io/docs/latest/cluster

Hi Fangjin,

I have read that documentation, however it while it gives guidelines for a specific hardware configuration it isn’t clear how to map that to other hardware configurations. Druid’s memory requirements are very complex to understand and depend on many factors (and are different for each process): settings for buffers, number of threads, the number of dimensions in your data source, how many kafka partitions you are ingest etc. etc.

I have deployed Druid as Imply recommends with the Historical and MiddleManagers sharing a host. However, I am constantly having problems with the Historical node running out of memory and I do not know why. My understanding (probably incomplete/wrong) is that the Historical node requires:

Java Opts(Xmx + MaxDirectMemorySize) + Runtime Properties((druid.processing.numThreads + 1) * druid.processing.buffer.sizeBytes + druid.cache.sizeInBytes)

Bytes available to it to run. My hardware configuration is 64GB RAM, 16 cores and 500+ GB of disk. I have configured my Historical node as follows:

jvm.config

-server

-Xms18g

-Xmx18g

-XX:NewSize=8g

-XX:MaxNewSize=8g

-XX:MaxDirectMemorySize=8g

-XX:+UseConcMarkSweepGC

-XX:+PrintGCDetails

-XX:+PrintGCTimeStamps

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=/services/druid/data/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

-Dservicename=historical

-XX:+HeapDumpOnOutOfMemoryError

-XX:+CrashOnOutOfMemoryError

-XX:ErrorFile=/var/log/services/druid/historical_pid_%p.log

runtime.properties:

druid.service=druid/historical

druid.port=8083

HTTP server threads

druid.server.http.numThreads=50

Processing threads and buffers

druid.processing.buffer.sizeBytes=536870912

druid.processing.numThreads=2

Segment storage

druid.segmentCache.locations=[{“path”:"/services/druid/data/segment-cache",“maxSize”:130000000000}]

druid.server.maxSize=130000000000

Make all query types cacheable. This is not the default, which excludes group-by and select quereis.

So it might be a terrible idea.

druid.historical.cache.unCacheable=

druid.historical.cache.useCache=true

druid.historical.cache.populateCache=true

I changed the supervise config file to run ONLY the Historical node, and it still crashes with OOM. It does not receive any queries at all. All it does is loads all of the segments from cache, and then it starts to announce their availability. During the announcing phase it crashes with OOM and supervise restarts it. The only way to get the node back is to delete the segment cache, at which point it will work for roughly 24 hours and then it will fail again. It could be that I am badly misunderstanding how I have to configure it, but it seems to me that it should be able to start up no matter how badly I have configured its memory allocation. There is plenty of RAM available on the host, so I concluded that I hadn’t allocated sufficient heap space for the number of segments it has been asked to manage. But that doesn’t seem right - my historical node only has roughly 48GB in segments assigned to it.

I tried changing the configuration to:

-server

-Xms32g

-Xmx32g

-XX:NewSize=16g

-XX:MaxNewSize=16g

-XX:MaxDirectMemorySize=12g

-XX:+UseConcMarkSweepGC

-XX:+PrintGCDetails

-XX:+PrintGCTimeStamps

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=/services/druid/data/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

-Dservicename=historical

-XX:+HeapDumpOnOutOfMemoryError

-XX:+CrashOnOutOfMemoryError

-XX:ErrorFile=/var/log/services/druid/historical_pid_%p.log

And I still get OOM errors.

The Java dump shows:

Memory: 4k page, physical 131840252k(14636344k free), swap 4194300k(2312500k free)

So still 14GB of physical RAM free.

–Ben

Historical nodes use off-heap memory to store intermediate results, and by default, all segments are memory mapped before they can be queried. Typically, the more memory is available on a historical node, the more segments can be served without the possibility of data being paged on to disk. On historicals, the JVM heap is used for GroupBy queries, some data structures used for intermediate computation, and general processing. One way to calculate how much space there is for segments is: memory_for_segments = total_memory - heap - direct_memory - jvm_overhead (~1G). Note that total_memory here refers to the memory available to the cgroup (if running on Linux), which for default cases is going to be all the system memory.

We recommend 250mb * (processing.numThreads) for the heap.

Let me know if this helps at all.

– FJ

Hi Fangjin,

Thanks very much for your reply. I deleted the segment cache on 3 of my historical nodes and that got them back up and running, however I am quite sure that at some point they are going to hit their limit and die again. I left 1 node in a bad state to try and resolve the problem so I can test potential solutions. On that host the segment-cache directory has 49GB of data (64687 segments - maybe the number of files is the problem?), and starting it (with the configuration I sent) will result in an infinite loop of OOM errors. I have since run a bunch of compaction jobs which merges the many segments created by the kafka indexing service into larger files, but this instance is still caching the old segment files.

Is it the case that I need enough RAM to contain everything allocated to the Historical? I thought it would page to disk when RAM was exhausted, so I had allocated 130GB of segments to each historical (having only 64GB of RAM). If it does page to disk, then what is the relationship between the segment cache and memory used? Something about the state of my cache plus the configuration of the JVM is causing it to die.

I tried allocating just 6gb of heap (I tried less, but it didn’t successfully load all of the segments before going OOM) with only 3 processing threads, and 50gb of direct memory. However it has the same behavior, it loads and announces all the segments then goes OOM and starts over.

–Ben

We have solved our historical node OOM issues. Found solution from this thread https://groups.google.com/forum/#!topic/druid-user/P2j8jya4k0k

Kernel parameter vm.max_map_count was to small in our case, by default it is 65530. You can see current usage using following command:

wc -l /proc//maps

In our case that number was really close to 65530.

Before increasing that kernel parameter we also had to remove indexCache after OOM crash (removing indexCache was part of upstart script).

I wonder why this kernel parameter is not mentioned in documentation, for example Elasticsearch doc clearly says that this parameter must be increased.

Regards,

Tarvi

Brilliant! Thank you, this was exactly what I needed to know. I have lost a tremendous amount of time to this issue and totally agree that it would be very helpful to have in the Druid documentation. Now I can set up alerting to warn if we approach the threshold again.

Thanks again,

–Ben

Hey guys,

Can you submit a PR that adds this to the documentation?