Relates to Apache Druid 0.20.1
- Master Node: m5.4xlarge (X1)
- Data Nodes: r5.8xlarge (X5) 160vCPUs
- Query Node: c5.16xlarge (X1) 64vCPUs
- Total Segments: 8000
- Total Disk Size: 280 GB
- Replication: 1x
Hello Druid Forum,
I am trying to increase query concurrency of my current cluster. I am mostly using timeseries queries which are generally very quick(avg: 1.5 seconds). I have done some profiling on the current cluster and generated flame graphs, but I was not able to pinpoint actual bottleneck. When I load test with 100 concurrent queries with useCache=false, I see datanodes CPU touching 85%-90%.
Here are the load test results
Can anyone please help me understanding the issue here? To scale number of concurrent queries, should I add more datanodes ? Currently each datanode has 31 processing threads and 66 http threads. I am trying to get to 100 reqs/sec.
Here is the flame graph html link, do you see any abnormalities here ?
Any help here would be much appreciated.
Thanks in advance.
Just a few thoughts here, 100 concurrent is a lot. The largest cluster I have worked with has less than 20 concurrent queries running in production…
I would look at the OS for other bottlenecks as well. Also, keep in mind that one thread can process a single segment, so you may not be able to use all processors, depending on your replication factor.
Thanks for your insights @Rachel_Pedreschi, I am looking into OS level bottlenecks. Meanwhile, do you see anything unusual in the above flame graphs ?
I have two suggestions from the flame chart
- You are using unique counts. It is best if you use the hll sketch from the data sketches library. It is the most optimized.
- I can see about 15% CPU on decompression. You could try keeping the indexes uncompressed (you can set this in the ingestion spec)
Since you have more than 100 concurrent queries (100 qps with 1.5 s query response) I would try increasing http threads to about 150
Give this a read: Basic cluster tuning · Apache Druid
On Broker RunTime Properties –
- Set druid.broker.http.numConnections to Number of Available Cores, as a starting point
- Set druid.processing.buffer.sizeBytes to 500MB to 1GB. If you set it to higher value, please make sure you have enough available memory on the instances.
On Historical and Task RunTime Properties –
- Set druid.server.http.numThreads to Number of Query Nodes x druid.broker.http.numConnections
- Set druid.processing.buffer.sizeBytes= 500MB to 1GB. Try and keep this in sync with the values on Broker Service.
If you start seeing TOO MANY OPEN FILES. Please increase ulimit size, at the time of Service start.