Managing JVM memory on brokers

Hi all,

After sending a quite demanding query Broker consumes all available JVM memory. Then, Broker starts responding very slowly. The problem is, when it reaches its memory limits it remains at this state, and we have to restart it.

We maxed out our settings:

JVM:

-server

-Xmx192g

-Xms192g

-XX:NewSize=6g

-XX:MaxNewSize=6g

-XX:MaxDirectMemorySize=64g

And some Broker runtime properties:

Processing threads and buffers

druid.processing.buffer.sizeBytes=2147483647

druid.processing.numThreads=11

druid.processing.numMergeBuffers=2

Screenshot showing metrics:

Any ideas how we should deal with this situation? Can we force Broker to clean its JVM memory instantly?

Whoa, that’s a really large broker heap. What kinds of queries are you doing? What is druid.server.http.numThreads set to?

The JVM reclaims unused heap memory on its own, so if you’re running out, it must all be used for something. Most likely, it’s related to a query or queries you are doing.

Besides “normal” Druid use like simple real-time and historical queries we’re experimenting with the new areas of applications. One of them is “Druid as a data preparation tool for data scientists”. Instead of using Hadoop+Spark we’re trying to use Druid for very convenient data wrangling, mainly for creating new synthetic features (like behavioral metrics).
So, we define many custom metrics, then split data by user id and then using our R SDK we extract CSV to use in machine learning experiments.

The example query has 24 metrics and 13 groupBy dimensions and some filters. Datasource has 300M rows with 5M users, so the end result would have 5M rows and 37 columns (hundreds of MBs or even more). I am aware that this is not the intended use of Druid, however comparing to manual writing and running map reduce jobs on Hadoop I see a potential here with the speed of experimenting and flexibility.

My question was, that after running that kind of query (and finishing it) the JVM memory is not reclaimed and what we can do about it.

Thanks!

And you were asking about some druid.server.http.numThreads. So here it is:

HTTP server threads

druid.server.http.numThreads=11

druid.server.http.maxIdleTime=PT15m

druid.broker.http.numConnections=50

druid.broker.http.readTimeout=PT15M

Servers have 6 physical cores / 12 threads, 256GB RAM.

Same for historicals, and all data on historicals are in memory (druid.server.maxSize < available RAM)

one way to deal with this is to lower the max heap size like that more GCs will occur.

Do you really need that big heap ?
BTW are you using any of the java script aggregators or filters ?

Not sure if we need that heap. Please look at the below chart:

green line: jvm/mem/max

blue line: jvm/mem/used

For normal use broker consumes c.a 30GB. But then, when we start running those heavy queries it consumes all it has. As you may see when we had 128GB (~140M bytes on the chart) Broker used almost all of it. Then we change it to 192GB and even that high amount was consumed entirely. So, can I suppose that if fact we need that big heap?

And when we reached those peaks we had to restart the broker, because of its slow responsiveness. Heap usage decreased to 10GB after this.

We use Javascript aggregators.

Not sure if we need that heap. Please look at the below chart:

green line: jvm/mem/max

blue line: jvm/mem/used

For normal use broker consumes c.a 30GB. But then, when we start running those heavy queries it consumes all it has. As you may see when we had 128GB (~140M bytes on the chart) Broker used almost all of it. Then we change it to 192GB and even that high amount was consumed entirely. So, can I suppose that if fact we need that big heap?

In General GC will not kick in unless some conditions are met like remaining sizes or time since last GC depending on the strategy.

The more you have heap the less you see GC in general, i will recommend to stick with 30G and see if GC will kick in more often and if that is acceptable pauses.

And when we reached those peaks we had to restart the broker, because of its slow responsiveness. Heap usage decreased to 10GB after this.

We use Javascript aggregators.

This might be your heap killer issue ! if performance and memory use are important, native Druid extension to do the aggregation is the way to go. Are you sure you can not replace those aggregate with native one ?

Keep in mind that Druid compiles JavaScript functions once per node per query and GC of the collection of the compiled classes can be an issue.

You might need to tune the GC strategy to ensure that the compiled classes are evicted.

Please refer to this http://druid.io/docs/latest/development/javascript.html