Druid Overload getting status from middlemanager connection refused

From dozens of tasks, a sample of them encounter error, like 10%, always.
The failing tasks have something in common:
They start fine, their logs show they are fine, but suddenly the tasks receive a shutdown request and shut themselves down gracefully, leading a the task being flagged as FAILED, logically.
it can also show the log of No task in the corresponding pending completion taskGroup[7] succeeded before completion timeout

The overlord log, however, shows something suspicious. For the given task, when it attempts to get its status via endpoint /druid/worker/v1/chat/{{TASK_ID}}/status, it fails (and it does retry it several times (based on the number of retries specified in config) and then when no success, it starts to shutdown them task via denouncing it over HTTP.

Well, It also happens for almost all ingestions of data-sources with high number of tasks, so there must be a common issue between them.

I suspect which configs of MiddleManager are affecting this?

How many total tasks are you running? Could this just be a capacity issue? How many worker slots do you have across middle managers?

No that cannot be a task capacity issue.
Because:

  • the task has already started, just in the middle of its work, due to overlord being unable to get its status, its send signal shutdown.
  • the task number and capacity of total middle managers also show sufficient empty capacity for new workers

Could the tasks be starved for resources and therefore unable to respond to the overlord status requests?
What does CPU utilization look like on the MM when this occurs?
How many cores in the MM? What is your worker capacity set to? If it is higher than the ( number of cores - 1), this could lead to CPU contention that might cause this.

There are tens of cores in the MM and the config for capacity is set to NUM_OF_CORES - 2 or 3
Also there are several MMs and it occurs on all of them.
CPU usage seems fine seeing with htop
I wonder it might be due to not enough chat handlers?
They are set exactly based on the given formula of druid (even a bit more)

Could be. Are these streaming ingestions? Are there a lot of queries going on as well?

How about http.numThreads on the Tasks?
From the docs:

On the Historical/Task side, this means that `druid.server.http.numThreads` must be set to a value at least as high as the sum of `druid.broker.http.numConnections` across all the Brokers in the cluster.

The practical recommendation is to use SUM( druid.broker.http.numConnections ) + 10, +10 is for things like status check.

Hope this helps.

I take that into account, nice tip!
Yes there are many queries going on the data.
Let’s see what this practical formula does.

Thanks.

also, can you check the status of the subtasks? I always have some failed jobs on (auto)compaction jobs were they start but fail keep the acquired locks