Druid GC Overhead Error

Hi all,

Recently encountered a GC overhead failure on a Broker Node (v.0.9.2) that had been running for three weeks. Broker could not run queries nor sync lookups. Wondering if this issue is fixed in 0.10?

Query Error #1

2017-09-25T07:48:08,916 ERROR [qtp1063494931-117[groupBy_GroupByQuery{dataSource=‘CompositeRevenue’, querySegmentSpec=LegacySegmentSpec{intervals=[2017-09-24T04:00:00.000Z/2017-09-26T03:59:59.000Z]}, limitSpec=NoopLimitSpec, dimFilter=((io.druid.query.filter.InDimFilter@ef56a11e)), granularity=PeriodGranularity{period=P5000Y, timeZone=America/Toronto, origin=1506225600000}, dimensions=[ExtractionDimensionSpec{dimension=‘CampaignID’, extractionFn=RegisteredLookupExtractionFn{delegate=null, lookup=‘MasterCampaignName’, retainMissingValue=false, replaceMissingValueWith=‘null’, injective=false, optimize=true}, outputName=‘MasterCampaignName’}, ExtractionDimensionSpec{dimension=‘CampaignID’, extractionFn=RegisteredLookupExtractionFn{delegate=null, lookup=‘MasterCampaignId’, retainMissingValue=false, replaceMissingValueWith=‘null’, injective=false, optimize=true}, outputName=‘MasterCampaignID’}, DefaultDimensionSpec{dimension=‘SiteID’, outputName=‘SiteID’}, ExtractionDimensionSpec{dimension=‘SiteID’, extractionFn=RegisteredLookupExtractionFn{delegate=null, lookup=‘AcuitySiteDomain’, retainMissingValue=false, replaceMissingValueWith=‘null’, injective=false, optimize=true}, outputName=‘SiteName’}, ExtractionDimensionSpec{dimension=‘ExchangeID’, extractionFn=RegisteredLookupExtractionFn{delegate=null, lookup=‘ExchangeName’, retainMissingValue=false, replaceMissingValueWith=‘null’, injective=false, optimize=true}, outputName=‘ExchangeName’}, ExtractionDimensionSpec{dimension=‘CountryCode’, extractionFn=RegisteredLookupExtractionFn{delegate=null, lookup=‘CountryName’, retainMissingValue=false, replaceMissingValueWith=‘null’, injective=false, optimize=true}, outputName=‘CountryName’}], aggregatorSpecs=, postAggregatorSpecs=, havingSpec=null}_f00e29e2-29af-4f8c-b333-0939775ca96c]] io.druid.server.QueryResource - Exception handling request: {class=io.druid.server.QueryResource, exceptionType=class java.lang.RuntimeException, exceptionMessage=Writer thread failed, exception=java.lang.RuntimeException: Writer thread failed, query=GroupByQuery{dataSource=‘GroupByQuery{dataSource=‘CompositeRevenue’, querySegmentSpec=LegacySegmentSpec{intervals=[2017-09-24T04:00:00.000Z/2017-09-26T03:59:59.000Z]}, limitSpec=NoopLimitSpec, dimFilter=((io.druid.query.filter.InDimFilter@ef56a11e)), granularity=PeriodGranularity{period=P5000Y, timeZone=America/Toronto, origin=1506225600000}, dimensions=[ExtractionDimensionSpec{dimension=‘CampaignID’, extractionFn=RegisteredLookupExtractionFn{delegate=null, lookup=‘MasterCampaignName’, retainMissingValue=false, replaceMissingValueWith=‘null’, injective=false, optimize=true}, outputName=‘MasterCampaignName’}, ExtractionDimensionSpec{dimension=‘CampaignID’, extractionFn=RegisteredLookupExtractionFn{delegate=null, lookup=‘MasterCampaignId’, retainMissingValue=false, replaceMissingValueWith=‘null’, injective=false, optimize=true}, outputName=‘MasterCampaignID’}, DefaultDimensionSpec{dimension=‘SiteID’, outputName=‘SiteID’}, ExtractionDimensionSpec{dimension=‘SiteID’, extractionFn=RegisteredLookupExtractionFn{delegate=null, lookup=‘AcuitySiteDomain’, retainMissingValue=false, replaceMissingValueWith=‘null’, injective=false, optimize=true}, outputName=‘SiteName’}, ExtractionDimensionSpec{dimension=‘ExchangeID’, extractionFn=RegisteredLookupExtractionFn{delegate=null, lookup=‘ExchangeName’, retainMissingValue=false, replaceMissingValueWith=‘null’, injective=false, optimize=true}, outputName=‘ExchangeName’}, ExtractionDimensionSpec{dimension=‘CountryCode’, extractionFn=RegisteredLookupExtractionFn{delegate=null, lookup=‘CountryName’, retainMissingValue=false, replaceMissingValueWith=‘null’, injective=false, optimize=true}, outputName=‘CountryName’}], aggregatorSpecs=, postAggregatorSpecs=, havingSpec=null}’, querySegmentSpec=LegacySegmentSpec{intervals=[2017-09-24T04:00:00.000Z/2017-09-26T03:59:59.000Z]}, limitSpec=NoopLimitSpec, dimFilter=null, granularity=PeriodGranularity{period=P5000Y, timeZone=America/Toronto, origin=1506225600000}, dimensions=, aggregatorSpecs=[CountAggregatorFactory{name=‘TotalRows’}], postAggregatorSpecs=, havingSpec=null}, peer=10.65.16.132}

Query Error #2

2017-09-24T19:25:25,857 ERROR [qtp1063494931-111[groupBy_CompositeRevenue_487fe2e0-404f-4d28-8880-2cc7b2c5b19e]] io.druid.server.QueryResource - Exception handling request: {class=io.druid.server.QueryResource, exceptionType=class java.lang.RuntimeException, exceptionMessage=com.fasterxml.jackson.databind.JsonMappingException: org.jboss.netty.handler.codec.embedder.CodecEmbedderException: java.lang.OutOfMemoryError: GC overhead limit exceeded (through reference chain: java.util.ArrayList[19836]), exception=java.lang.RuntimeException: com.fasterxml.jackson.databind.JsonMappingException: org.jboss.netty.handler.codec.embedder.CodecEmbedderException: java.lang.OutOfMemoryError: GC overhead limit exceeded (through reference chain: java.util.ArrayList[19836]), query=GroupByQuery{dataSource=‘CompositeRevenue’, querySegmentSpec=LegacySegmentSpec{intervals=[2017-09-24T13:25:00.020Z/2017-09-24T15:25:00.020Z]}, limitSpec=NoopLimitSpec, dimFilter=null, granularity=AllGranularity, dimensions=[DefaultDimensionSpec{dimension=‘SiteID’, outputName=‘SiteID’}, io.druid.query.dimension.LookupDimensionSpec@ad6f4fd4, DefaultDimensionSpec{dimension=‘ExchangeID’, outputName=‘ExchangeID’}, io.druid.query.dimension.LookupDimensionSpec@64dce9e], aggregatorSpecs=[LongSumAggregatorFactory{fieldName=‘Views’, name=‘Views’}, DoubleSumAggregatorFactory{fieldName=‘DemandAuctionPrice’, name=‘DemandAuctionPrice’}, DoubleSumAggregatorFactory{fieldName=‘AdvertiserValue’, name=‘AdvertiserValue’}], postAggregatorSpecs=, havingSpec=null}, peer=10.65.17.26}

Running jstat on the broken Broker Node process, showed 100% old space utilization and substantial full garbage collection. Included ps aux output as well.

S0
S1
E
O
M
CCS
YGC
YGCT
FGC
FGCT
GCT
0
0
99.66
100
78.79
60.67
2269
289.877
45202
29522.184
29812.061

druid 10683 32.0 12.4 63222892 49415484 ? Sl Sep06 8650:10 java -server -Xms32000m -Xmx32000m -XX:MaxDirectMemorySize=1048576m -Dlog4j.configurationFile=/usr/hdp/current/druid-broker/conf/_common/druid-log4j.xml -Dlog4j.debug -Duser.timezone=UTC -Dfile.encoding=UTF-8 -cp /usr/hdp/current/druid-broker/conf/_common:/usr/hdp/current/druid-broker/conf/broker:/usr/hdp/current/druid-broker/lib/*:/usr/hdp/current/hadoop-client/conf io.druid.cli.Main server broker

These numbers were much higher than a newly created Broker Node.

S0
S1
E
O
M
CCS
YGC
YGCT
FGC
FGCT
GCT
0
44.77
53.91
8.73
97.62
94.79
25
5.859
3
0.272
6.131

druid 1776 15.9 7.0 63178420 28043024 ? Sl 11:14 13:29 java -server -Xms32000m -Xmx32000m -XX:MaxDirectMemorySize=1048576m -Dlog4j.configurationFile=/usr/hdp/current/druid-broker/conf/_common/druid-log4j.xml -Dlog4j.debug -Duser.timezone=UTC -Dfile.encoding=UTF-8 -cp /usr/hdp/current/druid-broker/conf/_common:/usr/hdp/current/druid-broker/conf/broker:/usr/hdp/current/druid-broker/lib/*:/usr/hdp/current/hadoop-client/conf io.druid.cli.Main server broker