Router scaling issues

I’m currently looking into the scalability of a Druid cluster. We have fairly bursty utilisation. A sudden series of fairly heavy queries around a similar sort of time. So little to no utilisation for the rest of the day. So we want to scale down the nodes to as little as possible during those times.

I’ve devised a load test scenario using Gatling, which disables caching, and slightly randomises the query on the limit and date ranges. The query has some basic filtering and aggregation happening as well to simulate real-world scenarios.

Currently, I have my brokers, routers, and coordinators on an r5.xlarge EC2 instance, and my historicals on an i3.xlarge instance with a 15GB SSD. I’ve set the node group up (this is in Kubernetes by the way), to scale up and down based on utilisation. I’m also using the Druid operator. I’ve enabled horizontal pod autoscaling for all of the components essentially.

I’ve tried a lot of different settings but the issue seems to be that the router starts refusing connections even though the CPU and memory utilisation is super low (~10% CPU, ~20% memory). By the time the CPU’s high enough to scale up, thousands of requests have started failing and the test ends.

So the problem is twofold, the routers are sluggish to spin up when the auto-scaling kicks in, and the CPU utilisation is really low, and I have to set the threshold to something like 5% to get it to kick in in time. Which seems counterproductive.

I’ve tried setting the number of threads/max threads for the routers to something really high (I’ve tried up to around 500), and to something really low, so one per core per instance (~4).

I’m using the tiny cluster example in the Druid operator repo (which I’ve tweaked). The routers in Kubernetes have 4 CPU units for both desired and max; 8G memory max; and 4G memory desired.

Does anyone have any suggestions around how I can get the routers to utilise the resources available to them better? Or any general tips/articles/links around scaling on Kubernetes?

Any help is greatly appreciated!