How to improve query latency?

we have druid cluster with 5 brokers and 17 historicals including 286g data that is 1billion records over 7 years and ~8000 shards in 76 intervals

our use sql query like this

{

“query”:“select brandId,colType,created,offlineTime,onlineTime,title,venderId,wareId,wareStatus from product43 where venderId=‘106644’ and __time>=‘2019-03-01 00:00:00’ and __time<=‘2019-04-02 00:00:00’ limit 20”

}

and in client side, we use 8 thread with 30 connections to post request by WRK tools.

wrk -t9 -c30 -d10s --script=product.lua --latency http://10.163.127.215:8082/druid/v2/sql/ >> product-result &

wrk -t9 -c30 -d10s --script=product.lua --latency http://10.163.120.178:8082/druid/v2/sql/ >> product-result &

wrk -t9 -c30 -d10s --script=product.lua --latency http://10.163.122.95:8082/druid/v2/sql/ >> product-result &

wrk -t9 -c30 -d10s --script=product.lua --latency http://10.163.120.33:8082/druid/v2/sql/ >> product-result &

wrk -t9 -c30 -d10s --script=product.lua --latency http://10.167.31.170:8082/druid/v2/sql/ >> product-result &

wrk -t9 -c30 -d10s --script=product.lua --latency http://10.163.123.118:8082/druid/v2/sql/ >> product-result &

the result we got is

Running 10s test @ http://10.163.123.118:8082/druid/v2

8 threads and 30 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency    51.99ms   26.35ms 248.99ms   80.17%

Req/Sec    58.57     15.51   150.00     67.06%

Latency Distribution

 50%   44.98ms

 75%   62.01ms

 90%   85.46ms

 99%  150.22ms

4943 requests in 10.07s, 28.10MB read

Requests/sec: 491.09

Transfer/sec: 2.79MB

But sometime, we got better performance

Running 10s test @ http://10.163.127.215:8082/druid/v2/sql/

8 threads and 30 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency    45.55ms   18.69ms 116.15ms   68.73%

Req/Sec    65.75     16.08   110.00     74.38%

Latency Distribution

 50%   38.34ms

 75%   60.95ms

 90%   73.98ms

 99%   92.86ms

5259 requests in 10.01s, 31.22MB read

Requests/sec: 525.57

Transfer/sec: 3.12MB

the latency 99 is worse when we increase connections.

could you tell me how to solve the problem? or improve the performance?

the broker setting

broker jvm

-server

-Xms24g

-Xmx24g

-XX:MaxDirectMemorySize=50g

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=var/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

-XX:NewSize=6g

-XX:MaxNewSize=6g

-XX:+UseConcMarkSweepGC

-XX:+PrintGCDetails

-XX:+UseConcMarkSweepGC

-XX:+PrintGCDetails

-XX:+PrintGCTimeStamps

broker runtime

druid.host=10.163.127.215

druid.service=druid/broker

druid.plaintextPort=8082

HTTP server threads

druid.broker.http.numConnections=20

druid.server.http.numThreads=70

Processing threads and buffers

druid.processing.buffer.sizeBytes=1073741824

druid.processing.numThreads=35

Query cache

druid.broker.cache.useCache=true

druid.broker.cache.populateCache=true

druid.cache.type=caffeine

druid.cache.sizeInBytes=2000000000

druid.sql.enable=true

druid.broker.http.readTimeout=PT5M

druid.sql.planner.maxQueryCount=0

druid.broker.cache.useResultLevelCache=true

druid.broker.cache.populateResultLevelCache=true

druid.broker.cache.useCache=true

druid.broker.cache.populateCache=true

druid.broker.cache.unCacheable=

historical setting

jvm

-server

-Xms8g

-Xmx8g

-XX:MaxDirectMemorySize=32g

-Duser.timezone=UTC

-Dfile.encoding=UTF-8

-Djava.io.tmpdir=var/tmp

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

historical runtime

druid.host=historical

druid.service=druid/historical

druid.plaintextPort=8083

HTTP server threads

druid.server.http.numThreads=50

Processing threads and buffers

druid.processing.buffer.sizeBytes=536870912

druid.processing.numThreads=20

Segment storage

druid.segmentCache.locations=[{“path”:“var/druid/segment-cache”,“maxSize”:130000000000}]

druid.server.maxSize=500000000000

druid.historical.cache.unCacheable=

druid.historical.cache.populateCache=true

druid.historical.cache.useCache=true

druid.segmentCache.locations=[{“path”:"/var/druid/segment-cache",“maxSize”:500000000000}]

thanks

This query performance doesn’t seem too bad: 50–200ms is a respectable level of perf for the kinds of applications Druid typically targets. What were you hoping to see? By the way, you should be able to scale to higher levels of concurrent queries at the same perf by adding more servers. And before doing that, you should double check to make sure you’re using 100% of CPU on at least one server type (broker, historical). If not, you could adjust some tunings to use your hardware resources better.

Gian