Sizing issues, Java Garbage Collection stalling Historical process

I’ve asked this question also on the Slack channel but I started thinking that
the mailing list might be a better channel for such a question. Queries using groupBy are essentially broken.

Currently we’re running Druid 0.16.1 on a single node (16 cores, 256GB RAM, SSD storage), in a dockerized deployment. I’m
using the docker-compose.yml from the Github repo.

As out data grew we’re now hitting a situation where running certain queries stalls the Historical process due to Full
Java Garbage Collection. GC stalls the application threads for 30+ seconds, hence the Zookeeper connection times out and
marks the node as offline, causing the queries to fail. Also, this leads to the Historical re-publishing all segments,
incurring additional overhead etc. Sometimes it also starts thrashing, doing multiple Full GC in a row, each taking dozens of
seconds (see logs below).

I appreciate any input. Is it possible to tune Druid in such a way that I can run those queries again without killing the Historical? If it is not, what’s the culprit?

I’ve implemented most of what can be found on this page[1]:

  • -XX:+PerfDisableSharedMem
  • -XX:-UseBiasedLocking
  • GC logging to tmpfs

Important config options:

  • druid_processing_buffer_sizeBytes=1000000000
  • druid_indexer_fork_property_druid_processing_buffer_sizeBytes=1000000000
  • druid_query_groupBy_maxMergingDictionarySize=1000000000

Historical settings:

  • druid_processing_numThreads=36
  • DRUID_XMS=64g
  • DRUID_XMX=64g
  • DRUID_MAXDIRECTMEMORYSIZE=64g

[1] https://druid.apache.org/docs/latest/operations/recommendations.html

Historical Logs:
historical

2020-01-20T11:19:15,508 WARN [main-SendThread(zookeeper:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 50870ms for sessionid 0x1001d1612430009
historical
2020-01-20T11:19:15,509 INFO [main-SendThread(zookeeper:2181)] org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 50870ms for sessionid 0x1001d1612430009, closing socket connection and attempting reconnect
historical | 2020-01-20T11:19:16,604 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
historical | 2020-01-20T11:19:16,612 INFO [ZkCoordinator] org.apache.druid.server.coordination.ZkCoordinator - Ignoring event[PathChildrenCacheEvent{type=CONNECTION_SUSPENDED, data=null}]
historical | 2020-01-20T11:19:16,751 INFO [NodeTypeWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher

  • Ignored event type [CONNECTION_SUSPENDED] for nodeType watcher.
    historical

2020-01-20T11:19:17,975 INFO [main-SendThread(zookeeper:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper/172.19.0.3:2181. Will not attempt to authenticate using SASL (unknown error)
historical | 2020-01-20T11:19:17,976 INFO [main-SendThread(zookeeper:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to zookeeper/172.19.0.3:2181, initiating session

ZK Logs:
zookeeper | 2020-01-20 11:18:56,291 [myid:1] - INFO [SessionTracker:ZooKeeperServer@398] - Expiring session 0x1001d1612430009, timeout of 30000ms exceeded
zookeeper | 2020-01-20 11:19:17,980 [myid:1] - INFO [NIOWorkerThread-14:ZooKeeperServer@762] - Invalid session 0x1001d1612430009 for client /172.19.0.11:33186, probably expired
zookeeper

2020-01-20 11:19:47,904 [myid:1] - WARN [NIOWorkerThread-17:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x1001d161243000b, likely client has closed socket
zookeeper
2020-01-20 11:20:18,290 [myid:1] - INFO [SessionTracker:ZooKeeperServer@398] - Expiring session 0x1001d161243000b, timeout of 30000ms exceeded

Historical GC logs:
2020-01-20T11:18:32.196+0000: 423.112: [GC (Allocation Failure) [PSYoungGen: 157476K->72008K(171008K)] 66907574K->66891486K(67023872K), 0.1572892 secs] [Times: user=1.19 sys=0.04, real=0.16 secs]
UseAdaptiveSizePolicy actions to meet *** throughput goal ***
GC overhead (%)
Young generation: 95.72 (attempted to grow)
Tenured generation: 4.22 (no change)
Tenuring threshold: (attempted to decrease to avoid survivor space overflow) = 1
2020-01-20T11:18:32.353+0000: 423.269: [Full GC (Ergonomics)
[PSYoungGen: 72008K->0K(171008K)] [ParOldGen: 66819478K->60160168K(66852864K)] 66891486K->60160168K(67023872K), [Metaspace: 56302K->56289K(61440K)], 42.4896252 secs] [Times: user=312.25 sys=8.21, real=42.49 secs]
2020-01-20T11:19:14.843+0000: 465.759: Total time for which application threads were stopped: 42.6489335 seconds, Stopping threads took: 0.0008459 seconds
2020-01-20T11:19:14.850+0000: 465.766: Application time: 0.0064550 seconds
UseAdaptiveSizePolicy actions to meet *** throughput goal ***
GC overhead (%)
Young generation: 95.72 (attempted to grow)
Tenured generation: 8.65 (attempted to grow)
Tenuring threshold: (attempted to decrease to avoid survivor space overflow) = 1

Historical GC logs (thrashing):
2020-01-20T12:29:39.423+0000: 2288.114: [Full GC (Ergonomics) [PSYoungGen: 87788K->64400K(171008K)] [ParOldGen: 66852863K->66852668K(66852864K)] 66940652K->66917069K(67023872K), [Metaspace: 62491K->62480K(65536K)], 71.3782513 secs] [Times: user=545.29 sys=5.62, real=71.38 secs]
2020-01-20T12:30:50.801+0000: 2359.493: Total time for which application threads were stopped: 71.3796675 seconds, Stopping threads took: 0.0003425 seconds
2020-01-20T12:30:50.804+0000: 2359.495: Application time: 0.0025729 seconds
2020-01-20T12:30:50.805+0000: 2359.496: Total time for which application threads were stopped: 0.0011259 seconds, Stopping threads took: 0.0004307 seconds
2020-01-20T12:30:50.812+0000: 2359.503: Application time: 0.0067101 seconds
UseAdaptiveSizePolicy actions to meet *** throughput goal ***
GC overhead (%)
Young generation: 80.34 (attempted to grow)
Tenured generation: 66.57 (attempted to grow)
Tenuring threshold: (attempted to decrease to avoid survivor space overflow) = 1
2020-01-20T12:30:50.812+0000: 2359.504: [Full GC (Ergonomics) [PSYoungGen: 86016K->80000K(171008K)] [ParOldGen: 66852863K->66852803K(66852864K)] 66938879K->66932803K(67023872K), [Metaspace: 62482K->62482K(65536K)], 66.9859847 secs] [Times: user=512.08 sys=5.44, real=66.99 secs]
2020-01-20T12:31:57.799+0000: 2426.490: Total time for which application threads were stopped: 66.9868418 seconds, Stopping threads took: 0.0000715 seconds
2020-01-20T12:31:57.809+0000: 2426.500: Application time: 0.0101558 seconds
UseAdaptiveSizePolicy actions to meet *** throughput goal ***
GC overhead (%)
Young generation: 80.34 (attempted to grow)
Tenured generation: 74.92 (attempted to grow)
Tenuring threshold: (attempted to decrease to avoid survivor space overflow) = 1
2020-01-20T12:31:57.810+0000: 2426.501: [Full GC (Ergonomics) [PSYoungGen: 86016K->83065K(171008K)] [ParOldGen: 66852807K->66852829K(66852864K)] 66938823K->66935895K(67023872K), [Metaspace: 62485K->62485K(65536K)], 58.3664439 secs] [Times: user=452.05 sys=4.30, real=58.36 secs]
2020-01-20T12:32:56.177+0000: 2484.868: Total time for which application threads were stopped: 58.3678305 seconds, Stopping threads took: 0.0003242 seconds
2020-01-20T12:32:56.177+0000: 2484.869: Application time: 0.0006139 seconds

Can you send an EXPLAIN of the query that you are running that gets this error? Does bringing back a different result set change it at all? Are there any errors in the brokers?

Hi Felix,
Can you paste your broker jvm.confg, broker runtime,properties, historical jvm.config and historical runtime.properties files(total 4 files)?

Thank you,

–siva