Zookeeper java.lang.OutOfMemoryError: Java heap space

Hi, i’m trying to migrate from single server to a cluster
I followed this https://druid.apache.org/docs/latest/tutorials/cluster.html
with the same hardware as written there.
i’m trying to start the master with the zookeeper on the same node.
i have approximately 193000 segments.
in the middle of loading the segment from s3 deep storage i’m getting an error from the zookeeper and the loading stops.

2020-09-06T09:17:13,337 ERROR [SyncThread:0] org.apache.zookeeper.server.ZooKeeperCriticalThread - Severe unrecoverable error, from thread : SyncThread:0
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_265]
at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) ~[?:1.8.0_265]
at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1127) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:419) ~[zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:169) [zookeeper-3.4.14.jar:3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf]
2020-09-06T09:17:13,341 INFO [SyncThread:0] org.apache.zookeeper.server.ZooKeeperServerListenerImpl - Thread SyncThread:0 exits, error code 1

this is the jvm.config of the zookeeper

-server
-Xms128m
-Xmx128m
-Duser.timezone=UTC

what should i configure for the zookeeper in terms of memory ?
Thank you.

I forgot to mention this is druid 0.18.1

whats the cluster configuration including zoo keeper node?

added the configuration in attached file

hardware (aws):
master + zookeeper - m5.2xlarge
query - m5.2xlarge
data - i3.4xlarge

Thank you for your help!

druid.zip (12 KB)

Did you get 128M from a template or recommendation? It seems pretty small. I notice that in one distro I downloaded, under conf/zk/jvm.config, it has Xms=Xmx=2G.

I would suggest to separate master & zookeeper node.Because already coordinator & overlord sharing the master node. And also if master node restart/change instance type , zoo keeper has to be outside to recognize this outage & later plug-in new master to the cluster.

After separating the nodes, increase the jvm size in jvm.config in both the files.
M3.2xl has 30 GB RAM memory. You can bump up to 80% I.e. 24GB in jvm.config.

Just a side note: Make sure you have enough direct memory for your services (especially, historicals, middle managers and brokers) before bumping up min/max heap.

Also, I would use G1GC and see if that makes any difference. Bumping up heap will only postpone the problem if there is actually a mem leak, on the side note.

Thank you everyone for the help, to answer all the questions above:
Ben, I got it from the druid zip file, maybe there’s should be updated instructions for the zookeeper in this page regarding the hardware and memory configuration for the zookeeper:

https://druid.apache.org/docs/latest/tutorials/cluster.html

Jay, I will try 2GB memory next week and I will report with the development of the change.
if I split the master and the zookeeper, do you think m5.large (2vcpu, 8GB RAM) will suffice to run the zookeeper or I need a larger/smaller instance for it?
i’m already planning to use m5.2xlarge for the master which has 32GB RAM, druid pages recommends 15GB for the jvm.config for the instance i’m going to use, is it overkill to increase the memory for the master to 24GB ?

Karthik, thanks for the side note, taken for consideration, all the services will be splitted to multiple servers with enough memory, the one I wasn’t sure about was the zookeeper and I wasn’t sure which configuration should be configured it with and which hardware.

i believe 2CPU could be smaller for zk.
I wud suggest to run the ingestion with master node m5.2xlarge, check free memory ‘free-g’.Ingestion perspective, it depends on how many peon tasks communicating with overlord based the data volume being ingested.If the non-jvm is unused, you can increase the jvm.

/me enters the room with a cup of tea and throws something that may or may not be helpful onto the table then walks away silently.

https://support.imply.io/hc/en-us/articles/360015465773-Zookeeper-Best-Practices-for-Imply

:smiley: