We have a druid cluster with 10 historical nodes, 1 coordinator/overlord and 2 middle mangers and 2 brokers with 3 node zookeeper cluster
Overall space availability is 200 TB, cluster looks stable for 3-4 days after which historical nodes becomes invisible from coordinator console one by one. But when we check the historical node they seem to be up and running. Later once we cleanup index cache and restart the historical nodes they become visible on the coordinator console but the problem persists
Also there are no space constraints and error spike on our zookeeper cluster and they look stable
Please advise us on how to maintain the cluster stability
Also, whenever we are restarting historical nodes we are cleaning up the index cache which comes around few TB, so after restarting they are getting downloaded a fresh
Check how many files are in your segment-cache. Linux has a limit on the number of memory mapped segments a single process can have open (roughly 65k). When I hit that limit my historical nodes would come up, load and announce all the segments and then die with an OOM error. The supervise script would then restart them. They look to be up, and they are logging, but they are never serving any requests. If you have 200TB of data on 2 nodes then it is possible that you are hitting the limit.
Thanks for the reply, We did face that limit issue initially and after setting the below limit we don’t see any OOM issues
sudo sysctl -w vm.max_map_count=262144
We are kind of struck on this issue. Any help on this would be much appreciated
Do the nodes show up in coordinator console again if you restart them without cleaning up the segment cache ?
Also, do you see any exceptions in the historical logs ?
Only after cleaning up the cache and restarting the historicals are visible in the coordinator, on few other nodes even after clean up and restart they are not visible.
Our average segment size is 32 MB