About the Coordinator node RAM occupation problem

Relates to Apache Druid 0.22.1

I have deployed a druid production cluster on the cluster.


here is my config

brokers

runtime.properties


druid.service=druid/broker
        
# HTTP server threads
druid.broker.http.numConnections=50
druid.broker.http.maxQueuedBytes=10MiB
druid.server.http.numThreads=60
  
# Processing threads and buffers
druid.processing.buffer.sizeBytes=500MiB
druid.processing.numMergeBuffers=6
druid.processing.numThreads=1
druid.sql.enable=true

jvm.options

-Xmx4g
-Xms2g
-server
-XX:MaxDirectMemorySize=6g
-XX:+ExitOnOutOfMemoryError
-XX:+UseG1GC
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data

coordinators

runtime.properties

druid.service=druid/coordinator
        
# HTTP server threads
druid.coordinator.startDelay=PT30S
druid.coordinator.period=PT30S
  
# Configure this coordinator to also run as Overlord
druid.coordinator.asOverlord.enabled=true
druid.coordinator.asOverlord.overlordService=druid/overlord
druid.indexer.queue.startDelay=PT30S
druid.indexer.runner.type=local
  
druid.worker.capacity=20

jvm.options

-Xmx6g
-Xms6g
-server
-XX:MaxDirectMemorySize=6g
-XX:+ExitOnOutOfMemoryError
-XX:+UseG1GC
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data

historicals

runtime.properties

druid.service=druid/historical
druid.server.http.numThreads=50
druid.processing.buffer.sizeBytes=500MiB
druid.processing.numMergeBuffers=2
druid.processing.numThreads=7
# Segment storage
druid.segmentCache.locations=[{\"path\":\"/druid/data/segments\",\"maxSize\":10737418240}]
druid.server.maxSize=10737418240
  
# Query cache
druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true
druid.cache.type=caffeine
druid.cache.sizeInBytes=256MiB

jvm.options

-Xmx8g
-Xms4g
-XX:MaxDirectMemorySize=13g
-server
-XX:MaxDirectMemorySize=6g
-XX:+ExitOnOutOfMemoryError
-XX:+UseG1GC
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data

routers

runtime.properties

druid.service=druid/router
        
# HTTP proxy
druid.router.http.numConnections=50
druid.router.http.readTimeout=PT5M
druid.router.http.numMaxThreads=100
druid.server.http.numThreads=100
  
# Service discovery
druid.router.defaultBrokerServiceName=druid/broker
druid.router.coordinatorServiceName=druid/coordinator
  
# Management proxy to coordinator / overlord: required for unified web console.
druid.router.managementProxy.enabled=true

jvm.options

-Xmx1g
-Xms1g
-server
-XX:MaxDirectMemorySize=6g
-XX:+ExitOnOutOfMemoryError
-XX:+UseG1GC
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data

middlemanagers

jvm.options

-Xmx6G
-Xms6G
-server
-XX:MaxDirectMemorySize=6g
-XX:+ExitOnOutOfMemoryError
-XX:+UseG1GC
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Dlog4j.debug
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Djava.io.tmpdir=/druid/data

my problem

my Coordinator node takes 25G ~ 29G RAM. I don’t know if the memory usage is reasonable ? How should I estimate the coordinator ram usage?

You can set the Coordinator heap to the same size as your Broker heap, or slightly smaller.

Memory usage is way beyond the JVM settings which is definitely odd. Can you logon to the pod?
It would be interesting to see which processes are using up the memory.
You can use something like:
ps -o pid,user,%mem,command ax | sort -b -k3 -r

According to the docs, the coordinator uses more heap for more servers, more segments and more tasks. Are any of these large numbers in your cluster?
Also just noticed that you are using the coordinator as an overlord as well, which would also use more heap and scales memory usage with number of tasks.

Also noticed that you have druid.worker.capacity set in the coordinator. I’m not sure that it does anything there, it should be part of the middle manager configuration and should be set to (vCPUs - 1) allocated to the middle manager pod.

Just a wild idea, perhaps the settings are creating 20 Peon JVMs on the coordinator pod… let us know what you find.

But after reading the configuration docs a bit more, I noticed you are using druid.indexer.runner.type=local I think this needs to be set to remote in order to use the middlemanagers to run the tasks.
Take a look at the docs here:

Thanks !! I have solved the problem


  1. I executed the ps command on the coordinators node. I see multiple java processes. The number is the same as the number of my ingest tasks. I forgot to take a screenshot here.
  2. set druid.indexer.runner.type=remote , restart service
  3. The java process under the coordinators node disappears and ram drops