Had to cleanup the segment-cache often to get the historicals up

I have a cluster with 6 data nodes i.e. which has historicals and MM running on the same boxes , every time there is a maintenance i have to bring up my cluster . My historicals keep running for hours and they never come up , the segment loading process runs in loops and never finishes . So cleanup the segment cache manually on all the historicals and load it from the Deep storage , but this has become like a regular activity . Is this how everybody load their segments and get the historicals up every time during a maintenance , want to know if there is something that i am missing or any other approach that i need to follow .

Are you seeing OOMs? May we please see the logs?

Thanks for reply and below is the screenshot of the log .
2022-10-18T18:24:48,551 ERROR [CommitProcWorkThread-6] org.apache.zookeeper.server.NIOServerCnxnFactory - Thread Thread[CommitProcWorkThread-6,5,main] died
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_262]
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) ~[?:1.8.0_262]
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) ~[?:1.8.0_262]
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) ~[?:1.8.0_262]
at java.io.DataOutputStream.write(DataOutputStream.java:107) ~[?:1.8.0_262]
at org.apache.jute.BinaryOutputArchive.writeString(BinaryOutputArchive.java:109) ~[zookeeper-jute-3.5.9.jar:3.5.9]
at org.apache.zookeeper.proto.GetChildren2Response.serialize(GetChildren2Response.java:56) ~[zookeeper-jute-3.5.9.jar:3.5.9]
at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123) ~[zookeeper-jute-3.5.9.jar:3.5.9]
at org.apache.zookeeper.server.ServerCnxn.sendResponse(ServerCnxn.java:79) ~[zookeeper-3.5.9.jar:3.5.9]
at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:690) ~[zookeeper-3.5.9.jar:3.5.9]
at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:492) ~[zookeeper-3.5.9.jar:3.5.9]
at org.apache.zookeeper.server.quorum.Leader$ToBeAppliedRequestProcessor.processRequest(Leader.java:952) ~[zookeeper-3.5.9.jar:3.5.9]
at org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:298) ~[zookeeper-3.5.9.jar:3.5.9]
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:155) ~[zookeeper-3.5.9.jar:3.5.9]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_262]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_262]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
2022-10-18T18:24:48,616 WARN [NIOWorkerThread-110] org.apache.zookeeper.server.NIOServerCnxn - Unable to read additional data from client sessionid 0x200300810240005, likely client has closed socket

Hi @vishalth,
What log did that stack trace come from?
What’s in the historical log?
What’s in the coordinator log?

What kind of maintenance are you doing? Normally for maintenance you follow the process upgrade order described here: Rolling updates · Apache Druid

An individual historical restart after maintenance, should:

  • evaluate its existing segment-cache by reading the contents of its local filesystem and announce its contents through zookeeper
  • the coordinator will then make decisions about dropping/loading segments on that historical based on what it already has and any rebalancing of segments that might be needed.

Hi Sergio thanks for the reply . Actually the issue started when I started seeing the " ERROR [main-EventThread] org.apache.curator.ConnectionState - Authentication failed" in the cooridnator and when i saw the logs i saw this message in the zk logs :

Running [ZooKeeper], logging to [/druid/apache-druid-0.23.0/bin/…/log/zookeeper.log] if no changes made to log4j2.xml
Removing file: Oct 31, 2022 6:40:15 AM XX/XX/version-2/snapshot.50005ed99
2022-10-31 09:57:03,550 NIOWorkerThread-94 ERROR An exception occurred processing Appender FileAppender org.apache.logging.log4j.core.appender.AppenderLoggingException: java.lang.OutOfMemoryError: Java heap space
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:165)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:134)
at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:125)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:89)
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:542)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:500)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:483)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
at org.apache.logging.log4j.core.Logger.log(Logger.java:161)
at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2205)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2159)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2142)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2017)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1983)
at org.apache.logging.slf4j.Log4jLogger.error(Log4jLogger.java:319)
at org.apache.zookeeper.server.NIOServerCnxnFactory$1.uncaughtException(NIOServerCnxnFactory.java:92)
at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1057)
at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1052)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1959)
Caused by: java.lang.OutOfMemoryError: Java heap space

Today the system went down again and i see the same errors in the coordinator and zk logs , any idea on the error and whats causing this

Which version of ZooKeeper are you using? I just noticed this in the ZooKeeper doc:

Note: Starting with Apache Druid 0.22.0, support for ZooKeeper 3.4.x has been removed