[Urgent] Queries slow on Broker.

I need some quick thoughts on troubleshooting our broker node in Production.

We use Druid v.0.8.3 in production. Everything looked fine until 2 days back when broker nodes started having slow query issues during our peak hour. We have a significant increase in queries recently. I looked at CPU and Memory on the broker nodes but so sign there. Our segment size is 5 minutes right now with 5 minutes window period. Later in the day, segments are changed to 15-minute segments by Hadoop batch pipeline.

Some facts and steps for troubleshooting:

  • There were some errors on middle managers data but these looked benign to me. Attached.

  • Attaching log which explains actual query finished very fast in like 50 ms.

  • Attaching Druid broker config.

  • On peak time we send more than 50 queries per second.

  • Attaching JVM profile logs.

java params:

/usr/bin/java -server -Xmx6g -Xms6g -XX:NewSize=1500m -XX:MaxNewSize=1500m -XX:MaxDirectMemorySize=16g -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Djava.io.tmpdir=/tmp -classpath config/_common:config/broker:lib/* -Ddruid.properties.file=config/broker/runtime.properties io.druid.cli.Main server broker

Thanks,

Karan

broker.log (5.21 KB)

brokerConfig.txt (1 KB)

jstat-profile.txt (397 KB)

Hey Karan,

Could you look at a thread dump (jstack -l [pid]) of the broker during a time when queries are slow? From what you describe it sounds like you may have a few long-running/slow queries blocking up the broker http thread pool. If you see that all the broker http threads (qtpXXX) are occupied but the historicals are not running at capacity, then you could try:

  • set druid.server.http.numThreads higher on the broker to allow more concurrent queries.

  • tracking down which queries are slow and eliminating them or optimizing them.

Hey Gian,

Thank you for your response. I think slow queries were the culprit on historical nodes and after increasing the druid.server.http.numThreads to 200 from 10 resolved it.

I still have another urgent issue where all realtime queries on open segments on middlemanagers are timing out. Segment size is right now 5 minutes with 5 minutes window interval so we ended up serving present time - 10 minutes queries.

Error on broker node:

2016-10-03T17:20:16,146 ERROR [qtp1232948374-206] io.druid.server.QueryResource - Exception handling request: {class=io.druid.server.QueryResource, exceptionType=class com.metamx.common.RE, exceptionMessage=Failure getting results from[http://ip-172-31-39-139.us-west-2.compute.internal:8530/druid/v2/] because of [org.jboss.netty.handler.timeout.ReadTimeoutException], exception=com.metamx.common.RE: Failure getting results from[http://ip-172-31-39-139.us-west-2.compute.internal:8530/druid/v2/] because of [org.jboss.netty.handler.timeout.ReadTimeoutException], query=TopNQuery{dataSource=‘EntityAuth-streaming’, dimensionSpec=DefaultDimensionSpec{dimension=‘video_id’, outputName=‘video_id’}, topNMetricSpec=NumericTopNMetricSpec{metric=‘count’}, threshold=5, querySegmentSpec=LegacySegmentSpec{intervals=[2016-09-28T17:15:00.403Z/2016-10-03T17:15:00.404Z]}, dimFilter=(org_id = 7554 && new_video = true && !video_id = ), granularity=‘AllGranularity’, aggregatorSpecs=[CountAggregatorFactory{name=‘count’}], postAggregatorSpecs=}, peer=172.31.255.250}

com.metamx.common.RE: Failure getting results from[http://ip-172-31-39-139.us-west-2.compute.internal:8530/druid/v2/] because of [org.jboss.netty.handler.timeout.ReadTimeoutException]

Certain facts and troubleshooting steps:

  • Middle managers are ec2 instances of size r3.8xlarge

  • There are no errors reported on middle managers.

  • Attaching middle manager config.

  • Tried increasing middle manager config druid.indexer.fork.property.druid.server.http.numThreads to 200 but no improvement.

I really appreciate your help on this in advance. Thanks

middlemanager-config.txt (2.11 KB)

Hi Karan, it is a bit difficult for us to dig into this without more information.

Can you post the query metrics (http://druid.io/docs/0.9.1.1/operations/metrics.html) for the realtime segments?