Memory and CPU impact for Coordinator and Historical nodes

More frequently we see that coordinator nodes are unstable wherein either we see below error in logs and we couldn’t get the coordinate console being up all the time

8 cores CPU

48 GB RAM

  • GC overhead limit exceeded

For GC overhead error, I have updated the heap -Xmx settings but that seems to work for sometime of the day.

Xmx 4gb to Xmx 6gb

Direct memory: From 6gb to 8gb

To avoid overhead issues, how we should estimate the Xmx setting for the production coordinator node?

  • zookeeper keeper exception: NO node found …
    We moved Zookeeper instance to separate cluster instead of sharing with regular kafka cluster

Still, there are issues with Coordinator and Zookeeper keeper exception.

Historical Node

8 cores CPU and 48GB RAM

  • All historical nodes process are running, but they doesnt show up in coordinator console all the time.
  • Each node will not show itself being available in the coordinator console all the time.
  • I think this will affect my realtime task storage

Currently we get ~60 segments which goes through historical nodes and deep storage is already at 2TB capacity.

We use kafka indexing service which uses supervisor feature.

Chitra

Chitra,

You may want to first look at the issue with the Zookeeper as that mainly controls how the Druid master services operate. Can you paste here your common.properties? Try pinging zookeeper from your master servers and you couldn’t ping it, it could be that there’s something going on at the network level or Zookeeper is not properly setup as a quorum. Paste here as well your zoo.cfg and your datadir file /data/zookeeper/myid .

Your memory settings should be sufficient for the masters as they don’t eat that much resources.

Rommel Garcia

Hi Rommel,

I appreciate your response here.

I did setup new zookeeper cluster recently and found the problems persisted.

Master servers are responding for the ping to zookeeper cluster nodes.

Zookeeper config :

tickTime=2000
dataDir=/team/tools/zookeeper/var
autopurge.snapRetainCount=100
autopurge.purgeInterval=1
clientPort=2181
initLimit=5
syncLimit=2
server.1=:2888:3888
server.2=:2888:3888
server.3=:2888:3888

And verified cluster node myid value

hostname1: 1

hostname2: 2

hostname3: 3

Chitra

Try bouncing your Zooker cluster and Druid master services.

Rommel Garcia

Coordinator needs a “lot” of Heap to manage its segments and conduct its balancing strategies. This need is related to the number of nodes and segments in the cluster it is coordinating. You have a pretty big direct memory limit compared to everything else, and if that 48gb machine is ONLY hosting a coordinator, why not bump up the heap + direct to be somewhere near that limit (minus enough for the OS and maintenance tasks to do its job)?

Hi Charles and Rommel,

Thank you for your response.

I have Coordinator and Overload spanned up on 2 nodes and thus need to adjust the memory accordingly

VM1: Coordinator and Overlord process

48 GB RAM and 8 cores CPU

druid 26191 1 99 Jun10 ? 8-02:07:14 /team/tools/java/jdk8/bin/java -server -Xms18g -Xmx18g -XX:MaxDirectMemorySize=4g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/data/druid/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -cp /team/tools/druid/conf/_common:/team/tools/druid/conf/coordinator:/team/tools/druid/current/lib/* io.druid.cli.Main server coordinator

chjagann 42297 42063 0 12:52 pts/0 00:00:00 grep --color=auto druid

druid 63440 1 0 Jun07 ? 00:14:09 /team/tools/java/jdk8/bin/java -server -Xms2g -Xmx6g -XX:MaxDirectMemorySize=4g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/data/druid/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -cp /teaam/tools/druid/conf/_common:/team/tools/druid/conf/overlord:/team/tools/druid/current/lib/* io.druid.cli.Main server overlord

VM2: Coordinator and Overlord process

48 GB RAM and 8 cores CPU

druid 6632 1 99 Jun10 ? 2-22:08:06 /team/tools/java/jdk8/bin/java -server -Xms16g -Xmx16g -XX:MaxDirectMemorySize=4g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/data/druid/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.extensions.directory=/team/tools/druid/current/extensions -classpath /team/tools/druid/conf/_common:/team/tools/druid/conf/coordinator:/team/tools/druid/current/lib/* io.druid.cli.Main server coordinator

chjagann 58396 58177 0 05:54 pts/0 00:00:00 grep druid

druid 64337 1 0 Jun07 ? 00:27:18 /team/tools/java/jdk8/bin/java -server -Xms2g -Xmx6g -XX:MaxDirectMemorySize=4g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/data/druid/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.extensions.directory=/team/tools/druid/current/extensions -classpath /team/tools/druid/conf/_common:/team/tools/druid/conf/overlord:/team/tools/druid/current/lib/* io.druid.cli.Main server overlord

In order to understand and debug further, I stopped real-time ingestion through supervisors.

I do see below errors persisting in coordinator

  • org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /druid/segments/.…__default_tier_2019-06-12T12:56:48.615Z_89df7a7e2fd14f3eb3b87d3f26cef83e989

  • historical__default_tier_2019-06-12T12:56:47.305Z_6b04f77a58a54853856f6428737595aa126

  • org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /druid/segments/_historical__default_tier_2019-06-12T12:56:47.957Z_0c48a72943814f7e8e21501437db004a553

Before I start bouncing zookeeper and master servers, should I stop and clean up something on masters or zookeeper ensembles?

/tasks

/loadQueue

/segments

I am not sure but suspect if there is a clean way to understand, why we will end up in keeperErrorCode exception with Nonode.

I appreciate your responses here.

Chitra