Whatever I do I am unable to get more than 5 to 10 TPS

I am using 7 64 core and 250GB RAM data servers. Whatever configuration options I follow, I am unable to get more than 5 to 10 TPS. The druid hangs if it goes over 10TPS.

Here are my confugs:

Broker

druid.service=druid/broker
druid.plaintextPort=8082

HTTP server settings

druid.server.http.numThreads=120

HTTP client settings

druid.broker.http.numConnections=500
druid.broker.http.maxQueuedBytes=100000000
druid.server.http.defaultQueryTimeout=3600000

Processing threads and buffers

druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=40
druid.processing.numThreads=15
druid.processing.tmpDir=var/druid/processing

Query cache disabled – push down caching and merging instead

druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false
druid.broker.cache.useResultLevelCache=false
druid.broker.cache.populateResultLevelCache=false
druid.sql.planner.metadataSegmentCacheEnable=false
druid.query.search.maxSearchLimit=1000000000
druid.query.groupBy.maxMergingDictionarySize=100000000
druid.query.groupBy.maxOnDiskStorage=1000000000
druid.broker.cache.useResultLevelCache=true
druid.broker.cache.populateResultLevelCache=true

MiddleManager

druid.service=druid/middleManager
druid.plaintextPort=8091

Number of tasks per middleManager

druid.worker.capacity=20

Task launch parameters

druid.indexer.runner.javaOpts=-server -Xms2g -Xmx2g -XX:MaxDirectMemorySize=13g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
druid.indexer.task.baseTaskDir=var/druid/task

HTTP server threads

druid.server.http.numThreads=800

Processing threads and buffers on Peons

druid.indexer.fork.property.druid.processing.numMergeBuffers=8
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=1000000000
druid.indexer.fork.property.druid.processing.numThreads=4

Hadoop indexing

druid.indexer.task.hadoopWorkingPath=var/druid/hadoop-tmp

druid.realtime.cache.useCache=true
druid.realtime.cache.populateCache=true
druid.cache.type=caffeine
druid.cache.sizeInBytes=2000000000
druid.query.groupBy.maxMergingDictionarySize=100000000
#druid.query.groupBy.maxOnDiskStorage=5000000000
druid.query.groupBy.defaultStrategy=v2

Historicals

druid.service=druid/historical
druid.plaintextPort=8083

HTTP server threads

druid.server.http.numThreads=800

Processing threads and buffers

druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=40
druid.processing.numThreads=63
druid.processing.tmpDir=var/druid/processing

Segment storage

druid.segmentCache.locations=[{“path”:“var/druid/segment-cache”,“maxSize”:“300g”}]

Query cache

druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true
druid.cache.type=caffeine
druid.cache.sizeInBytes=2000000000

druid.query.groupBy.maxMergingDictionarySize=100000000
#druid.query.groupBy.maxOnDiskStorage=1000000000
druid.query.groupBy.defaultStrategy=v2

Can some one help? its blocking my deployment

I am using 1TB SSD for data on all nodes. The query is very simple, generally takes 200 to 500ms. its built on AWS ec2 instances

Could you elaborate on what you mean by ‘TPS’ please?

Assuming you are referring to Ingestion, nothing jumps out as being too much out of the ordinary in your MM config, but you could increase the worker capacity to this range (24-63) and reduce druid.server.http.numThreads to 100

Thanks Vijeth, Even without Kafka ingestion Historical are not going more than 50 TPS. Can you please share broker and router configs for 16 core server? surprisingly adding more historical nodes not helping. Also, What taxcount is recommended in Kafka ingestionSpec?

Hi rao222, I am still not sure what you mean by TPS. Here are the typical settings for a 16 node broker:

“druid.processing.numMergeBuffers”: 60,
“druid.processing.numThreads”: 1,
“druid.broker.http.numConnections”: 25,
“druid.processing.buffer.sizeBytes”: 500000000,
“druid.server.http.numThreads”: 60

In order for us to give you more guidance, could you provide your kafka specs/requirements in order for us to size the supervisor? ex. Expected throughput, number of partitions, query/ingestion SLA etc.

Hi Vijeth, Thanks for the information. TPS – Number of concurrent queries (throghput). Its a single group by query is being called with different company id and return 30 rows order by __time desc. The query takes around 200ms. but if I go up more than 20 concurrent, same query takes around 20 sec or more. also it times out. using more tasks even further slows down, Also can you please let me know what you mean sizing supervisor?

Ah, I see that you are looking at improving query concurrency. Assuming the historicals are 64 CPU machines, you really should be seeing better performance. especially when you scale the cluster horizontally.

I would start with cleaning up the config files as I see some conflicting parameters in them such as druid.broker.cache.useResultLevelCache being both true and false.

Your connection pooling parameters deviate a bit from best practices. I would reduce the druid.broker.http.numConnections on the broker to a more reasonable value of let’s say 50 assuming you have 1 broker, and then calculate the druid.server.http.numThreads on the historicals while considering how many brokers are in your cluster.

You can use this link for guidance:

If your cluster has 1 broker then I would say go with druid.server.http.numThreads=60 in our example

Also, please test with reducing the number of merge buffers to 20.

Please test with these and let me know how it goes, we can then look at more options to troubleshoot.