Configuration Review : Concurrent queries beyond 1000 requests per second

Hi All,

I am trying to configure the druid cluster for optimum performance of handling a concurrent load of around 5000 requests per second.
The queries are mostly groupBy and Time-series queries which single queries will return response in the range of 500ms to 2s.
But when I start increasing the number of requests beyond 1000 rps ,
the cpu spikes to 90%, the performance degrades drastically along with failed responses.
Please review these config parameters and let me know if the concurrency can be improved by tweaking in any params, currently the cluster works pretty well with a few hundred rps.

These are my configurations:
Data (3 nodes: 72 cores,144 GB RAM, 2TB gp2 HD)
Queries(2 nodes: 16 cores 122 GB RAM)

middleManager/runtime.properties:

druid.worker.capacity=31

Task launch parameters

druid.indexer.runner.javaOpts=-server -Xms1g -Xmx1g -XX:MaxDirectMemorySize=1g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
druid.indexer.task.baseTaskDir=var/druid/task

HTTP server threads

druid.server.http.numThreads=1100

Processing threads and buffers on Peons

druid.indexer.fork.property.druid.processing.numMergeBuffers=8

druid.indexer.fork.property.druid.processing.buffer.sizeBytes=1000000000

druid.indexer.fork.property.druid.processing.numThreads=3

druid.indexer.runner.javaOptsArray=["-server","-Xmx3g","-XX:MaxDirectMemorySize=15G"]

Hadoop indexing

druid.indexer.task.hadoopWorkingPath=var/druid/hadoop-tmp

druid.query.search.maxSearchLimit=1000000000

druid.query.groupBy.maxMergingDictionarySize=100000000
druid.query.groupBy.maxOnDiskStorage=1000000000

historical/runtime.properties

druid.service=druid/historical
druid.plaintextPort=8083

HTTP server threads

druid.server.http.numThreads=1100

Processing threads and buffers

druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=40

druid.processing.numThreads=80

druid.processing.tmpDir=/data/var/druid/processing

Segment storage

druid.segmentCache.locations=[{“path”:"/data/var/druid/segment-cache",“maxSize”:150000000000}]
druid.server.maxSize=1500000000000

Query cache

druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true
druid.cache.type=caffeine
druid.cache.sizeInBytes=512000000

druid.query.search.maxSearchLimit=1000000000
druid.query.groupBy.maxMergingDictionarySize=100000000
druid.query.groupBy.maxOnDiskStorage=1000000000
druid.server.http.maxSubqueryRows=2000000000

broker/runtime.properties

druid.service=druid/broker
druid.plaintextPort=8082

HTTP server settings

druid.server.http.numThreads=120

HTTP client settings

druid.broker.http.numConnections=500
druid.broker.http.maxQueuedBytes=100000000
druid.server.http.defaultQueryTimeout=3600000

Processing threads and buffers

druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=40

druid.processing.numThreads=15
druid.processing.tmpDir=/data/var/druid/processing

Query cache disabled – push down caching and merging instead

druid.broker.cache.useCache=true

druid.broker.cache.populateCache=true

druid.sql.planner.sqlTimeZone=IST
druid.sql.planner.maxTopNLimit=1000000000
druid.sql.planner.metadataSegmentCacheEnable=true
druid.broker.cache.useResultLevelCache=true
druid.broker.cache.populateResultLevelCache=true
druid.query.search.maxSearchLimit=1000000000

druid.query.groupBy.maxMergingDictionarySize=100000000
druid.query.groupBy.maxOnDiskStorage=1000000000

Regards,

Kundan

The druid version is: 0.18.1

I have observed that this happens when ingestion is running, on stopping ingestion historicals are able to serve at a very good throughput rate.
What could be the possible cause of it, any suggestions would be helpful.

Hi, I’m a developer evangelist with Imply and I wanted you to know that I’ve seen your question and am working to get you an answer. Thanks, Matt

Is there is locking on realtime segment while ingestion, so read has to wait?

Hey Kundan - Ingestion is purely handled by MiddleManager Peons / the Indexer - they consume a core per worker. Queries also consume a core, but per segment being read to answer each query (hence it so important that you check your segment numbers / sizes)

Maybe you are core bound?

Also, maybe check your connection pool…
https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html#connection-pool

Thanks, Peter, I had was not using enough number of peons in ingestion, increasing task count helped.
Regards,
Kundan

:thumbs-up: :smiley:

1 Like