Query performance issues on our production environment

We’re deploying Druid on our production environment, and we see that although the query performance is not bad most of the time, it deteriorates significantly once the query number increases.

Our hardware and configuration is as follows:

3 broker nodes, each having 32 cpus, 244GB memory, 320GB of storage.

Broker node configuration:





Broker memory configuration:






13 historical nodes, each having 32 cpus, 244GB memory, 320GB of storage (same as brokers).

Historical node configuration:





Historical memory configuration:






We have two data sources with a single thetaSketch metric each. The sizes of the data sources is about 900GB and 200GB.

All of our queries are groupBy queries (since we’re using a thetaSketch metric) with a large amount of unions and intersections. Our segment granularity is DAY.

When doing load tests with JMeter we see that the broker and historical query time increases to more than two seconds. We’re having a significant amount of pending segments (about 500-1000 per historical most of the time).

Our average GC time on brokers is 4s and historicals is 1.5s and GC count is constantly about 23 on brokers and 30-40 on historicals. The broker CPU average is at 800%. I can provide with additional metrics if it might help.

We’re also not using cache (tried it and it caused terrible performance. Perhaps we haven’t tuned it well but we believe we can achieve better performance even before we begin tuning the cache).

We believe that with this hardware we are able to achieve much better performance by tuning it correctly. We’re also flexible on our hardware since we’re running on the cloud, so we can change number and type of instances easily.

We really like Druid, and would like to achieve its full potential. Please advise.

Thank you,