Since upgrading from Druid 0.19.0 to 0.22.1 a couple months ago we have been struggling with Broker performance issues. Before the upgrade we never needed to restart the Brokers, and they ran with low CPU and GC values. The Brokers were running with 8 CPU, 32GB RAM, using 24G Heap, and mostly default settings from the Basic Cluster Tuning guide (7 Proc Threads, 3 Merge Buffers, 500MB Buffer Size).
After the upgrade the Brokers starting running out of memory and crashing within 12 hours. We figured that the new version just needed more resources, and so we began the exercise of scaling up the Brokers. We are now running them with 16 CPU, 128GB RAM, using 80G Heap, 40G MaxDirect Mem, and are using 15 Proc Threads, 16 Merge Buffers, and 500MB Buffer Size (we tried going to 1GB Buffers with poor results). We have even gone to settings higher than this, but consistently have the Brokers exhaust their resources.
Here is our current Broker runtime.properties and JVM settings.
Broker Config
runtime.properties
druid.service=druid/broker
druid.plaintextPort=8082
druid.tlsPort=8282
# HTTP server settings
druid.server.http.numThreads=40
druid.server.http.gracefulShutdownTimeout=PT1M
druid.server.http.defaultQueryTimeout=60000
# HTTP client settings
druid.broker.http.numConnections=30
druid.broker.http.maxQueuedBytes=10000000
# Processing threads and buffers
druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=16
druid.processing.numThreads=15
# Query protection settings
druid.query.scheduler.prioritization.strategy=threshold
druid.query.scheduler.prioritization.durationThreshold=PT24H
druid.query.scheduler.prioritization.adjustment=-1
druid.sql.planner.requireTimeCondition=false
druid.sql.planner.metadataSegmentCacheEnable=true
druid.sql.planner.metadataSegmentPollPeriod=60000
# Query cache disabled -- push down caching and merging to historicals and middlemanagers
druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false
# Monitors that emit self-monitoring metrics
druid.monitoring.monitors=["org.apache.druid.java.util.metrics.SysMonitor","org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.java.util.metrics.JvmCpuMonitor","org.apache.druid.java.util.metrics.JvmThreadsMonitor","org.apache.druid.client.cache.CacheMonitor","org.apache.druid.server.metrics.QueryCountStatsMonitor"]
JVM:
java -Xms12G -Xmx12G -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dorg.jboss.logging.provider=slf4j -Dnet.spy.log.LoggerImpl=net.spy.memcached.compat.log.SLF4JLogger -Dlog4j.shutdownCallbackRegistry=org.apache.druid.common.config.Log4jShutdown -Dlog4j.shutdownHookEnabled=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:/var/log/druid.gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:ParallelGCThreads=13 -XX:ConcGCThreads=8 -XX:+UseG1GC -XX:-UseBiasedLocking -Xms80G -Xmx80G -XX:MaxDirectMemorySize=40G -Dcom.sun.management.jmxremote -Dcom.sun.magement.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.password.file=/opt/druid/conf/druid/_common/jmxremote.password -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=1099 -Dcom.sun.management.jmxremote.rmi.port=1099 -Djava.rmi.server.hostname=10.10.203.27 -cp /opt/druid/conf/druid/_common:/opt/druid/conf/druid/broker:/opt/druid/lib/* org.apache.druid.cli.Main server broker
We have observed that the CPU and GC values both increase at the same. We keep seeing sudden increases until the Broker API becomes unresponsive, the Kubernetes Liveness Probe fails, and the pod is restarted. We also start experiencing “org.apache.druid.query.QueryCapacityExceededException” errors because the Brokers cannot acquire a merge buffer within 1-3 hours before they crash. The Brokers come back online with low CPU and GC, and work fine for 1-4 days before having the issue again.
This graph over the past few days shows the most recent example of this behavior.
This is happening on all three of our Druid clusters, with two being smaller clusters (1k queries per min with 3 Brokers each) and one being a med-large cluster (4k queries per min with 8 Brokers).
To get a better understanding of what is going on we got JMX configured on the Druid JVM, and used the VisualVM tool to analyze the Heap and Threads.
What we found is that observing the thread allocations allowed us to view which qtp threads SQL queries were being run on. As an SQL query is handled the thread name changes from ‘qtp1266602046-###’ to ‘sql[SQL-UUID]’. We can look at those query IDs in our logs and see the corresponding SQL statement. Once the SQL query is completed the thread name goes back to the qtp ID. The following GIF shows this typical behavior.
As we observed these threads during a CPU/GC increase, and saw that one of the qtp threads had changed its name to an SQL ID, and then never changed back. The thread seemed to get stuck pulling in data from somewhere. The allocated space for the thread just constantly increases. As this happened we saw Heap increasing faster, leading to more GC, which required more CPU. This GIF shows this behavior, where the four qtp threads at the top are the ones which are stuck.
We determined what one of the stuck qtp thread IDs was, and saw that it stopped producing any logs the moment it hung. The SQL ID in the thread name could not be found in the logs, and so we couldn’t see what the query was. The thread remains in this hung state producing no logs until the Broker is restarted.
We did a thread dump, and were able to find the hung threads, and what functions they were running at the time. We were hoping to find a common issue, but instead found that the four queries in this example were at three different functions. I will put the thread dump in a comment since it is too long to include here.
We are wondering if this is evidence of a Druid bug, a misconfiguration of our Brokers, or an example of very bad queries hanging up Druid? Since this wasn’t happening in Druid 0.19 we are currently labeling this as a bug.
Thank you for any help.