Overlord Scaling Issue

I’ve below Druid Stack.
1 Overlord - m5.12xlarge
50 MiddleManagers - i3en.24xlarge

I’m submitting S3 ingestion tasks of parquet files using index_parallel with maxSubTasks as 1000 but after 2-3 index_parallel tasks and around 2k+ tasks, overlord becomes super slow in creating new tasks.

Overlord Config is as below:
druid.service=druid/overlord
druid.plaintextPort=8081

druid.coordinator.startDelay=PT10S
druid.coordinator.period=PT5S

Run the overlord service in the coordinator process

#druid.coordinator.asOverlord.enabled=false
#druid.coordinator.asOverlord.overlordService=druid/overlord

druid.indexer.queue.startDelay=PT5S

druid.indexer.runner.type=remote
druid.indexer.storage.type=metadata
druid.indexer.storage.recentlyFinishedThreshold=PT2H
druid.indexer.runner.pendingTasksRunnerNumThreads=50
druid.manager.config.pollDuration=PT10M
druid.indexer.storage.recentlyFinishedThreshold=PT15M
druid.coordinator.curator.create.zknode.numThreads=100
druid.coordinator.balancer.strategy=random
druid.indexer.runner.maxZnodeBytes=15728640

JVM config:
-server
-Xms120g
-Xmx120g
-XX:+ExitOnOutOfMemoryError
-XX:+UseG1GC
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Djava.io.tmpdir=var/tmp
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Dderby.stream.error.file=var/druid/derby.log
-Djute.maxbuffer=15728640

Druid version is 0.19.0

What can I do to scale the overlord to run 10K+ tasks in parallel. Any help or direction would be really helpful. Thanks in Advance :slight_smile:

Maybe druid.indexer.queue.maxSize? Here’s the description in the docs:

Maximum number of active tasks at one time.

1 Like

You need to have enough worker capacity in your MM to handle all these subtasks.

Also, with the current release of druid, too many tasks cause memory pressure in the overlord especially when you refresh the UI screen with the task list. This is being fixed but I am not sure when it will reach OSS Druid. This may be affecting you as well.

1 Like

the default value is InT_MAX, so its already huge number

My middle manager has the capacity to run 7k+ tasks, but with 2-3k it becomes unhealthy. Memory pressure could be the cause. Did this get fixed in 0.23 release?

Your maximum worker capacity should be set to (CPUs per node- 1) * number of middle manager nodes as a general rule of thumb. It sounds like you have it set much higher and this will likely be a source of contention. Also have you set the MM peon JVM properties? “ druid.indexer.runner.javaOptsArray”

Each MM spawns one JVM per worker using these JVM options. It is expected that each JVM will use up a whole CPU. If you have too many things get contentious.

Hope this helps,
Sergio

We don’t run the same amount of tasks as the original author but we are seeing similar issues with the Overlord (16 CPU, 128G RAM, 15G heap) when ramping up the number of tasks. We actually only use 2-4 tasks regularly per MM (16 CPU, 128G RAM, 3G heap per Peon via druid.indexer.runner.javaOptsArray) and have a max worker capacity of 10 per MM (8 MM’s total).

What we’ve seen is strange, the Overlord starts to behave like a snail and seems to lose track of tasks and struggle with task management (rollout and completion). The UI is of course pretty unusable during this time and the resource usage (CPU/MEM/JVM HEAP) of the Overlord does not seem to increase even though number of TCP connections it is handling goes up to 900 at times.

We start to see a variety of errors like:

  1. Some tasks failing with: "errorMsg": "Task [index_kafka_REDACT_941fd57f52aebbb_gbbmjhmp] returned empty offsets after pause"
  2. A few with: "errorMsg": "The worker that this task is assigned did not start it in timeout[PT10M]. See overlord logs for more..."
  3. A few with:
  • "errorMsg": "Task [index_kafka_REDACT_b4b8fdbe7d46f26_mbljmdld] failed to return status, killing task"
  • "errorMsg": "Task [index_kafka_REDACT_ff20e3161a9445e_bkjimalf] failed to stop in a timely manner, killing task"
  • "errorMsg": "Task [index_kafka_REDACT_b5157008402d2aa_ogjhbpod] failed to return start time, killing task"
  • 1-2 with the usual error: "errorMsg": "An exception occured while waiting for task [index_kafka_REDACT_091c74b39f9c912_hckphlkm] to pause: [..."
  1. A few long running tasks that did not seem to be tracked/managed properly by the Overlord (super long running) - like it couldn’t keep up with everything going and lost track of these tasks but eventually got to them (we have taskDuration: 1 hour and completionTimeout: 30 or 45 min depending on supervisor)
  • SUCCESS - Duration: 2:16:05
  • SUCCESS - Duration: 2:17:05
  • SUCCESS - Duration: 2:16:06

Our configuration for Overlord looks like:

## runtime.properties ##
druid.service=druid/overlord
druid.plaintextPort=8090

# runner
druid.indexer.queue.startDelay=PT1M
druid.indexer.runner.type=remote
druid.indexer.storage.type=metadata
druid.indexer.runner.taskAssignmentTimeout=PT10M
druid.indexer.runner.taskShutdownLinkTimeout=PT3M

## jvm.config ##
-server
-Xms15g
-Xmx15g
-XX:MaxDirectMemorySize=10500m
-XX:+ExitOnOutOfMemoryError
-XX:+UseG1GC
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Djava.io.tmpdir=var/tmp
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-Dservice=overlord

More details here with screenshots of resource usage and showing the TCP connections spiking but not the resource usage which should also go up (at least it makes sense to) in the below linked Github comment:

@TechLifeWithMohsin - do you guys also see the Overlord/Tasks throwing out a lot of different errors and also slow/degraded task management (not just task starting/rollout but also completion handling)? What about resource consumption? Does it increase like we expect it to with you increasing the number of tasks or stay fairly stagnant with small jumps?

Tasks are getting picked by MM the moment its created by Overlord, problem is overlord is not creating sub-tasks as expected.

No this issue was not observed

ahh okay thanks for the response, was thinking this was a similar issue we are seeing and I could possibly provide more information. Sorry for butting in! Hope this gets resolved

OK wow so I can’t pretend to have seen a cluster with that many tasks running (!)
One thought that came to mind was http as this can bottleneck and cause queues:
There are some config items here:

And there’s a little bit here, too:

Just checking for clarity here, too … is it that the tasks are created but not running (ie. they come up but don’t start) or is it that they just don’t get created in the first place?

Overlord creates sub_tasks for given index_parallel tasks but it gets slow after creating ~800 tasks while my job is given 1500 tasks to create. Also I’ve seen sometimes it writes the status as running while no MM attached to it.

And overlord logs are filled with below lines:
2022-07-24T03:50:30,228 INFO [qtp168658781-1098] org.apache.druid.indexing.overlord.TaskLockbox - Task[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z] already present in TaskLock[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z]
2022-07-24T03:50:31,223 INFO [qtp168658781-1518] org.apache.druid.indexing.overlord.TaskLockbox - Task[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z] already present in TaskLock[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z]
2022-07-24T03:50:32,220 INFO [qtp168658781-1582] org.apache.druid.indexing.overlord.TaskLockbox - Task[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z] already present in TaskLock[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z]
2022-07-24T03:50:33,217 INFO [qtp168658781-1455] org.apache.druid.indexing.overlord.TaskLockbox - Task[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z] already present in TaskLock[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z]
2022-07-24T03:50:34,254 INFO [qtp168658781-1560] org.apache.druid.indexing.overlord.TaskLockbox - Task[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z] already present in TaskLock[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z]
2022-07-24T03:50:35,265 INFO [qtp168658781-1466] org.apache.druid.indexing.overlord.TaskLockbox - Task[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z] already present in TaskLock[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z]
2022-07-24T03:50:36,276 INFO [qtp168658781-1091] org.apache.druid.indexing.overlord.TaskLockbox - Task[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z] already present in TaskLock[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z]
2022-07-24T03:50:37,287 INFO [qtp168658781-1539] org.apache.druid.indexing.overlord.TaskLockbox - Task[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z] already present in TaskLock[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z]
2022-07-24T03:50:38,283 INFO [qtp168658781-1623] org.apache.druid.indexing.overlord.TaskLockbox - Task[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z] already present in TaskLock[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z]
2022-07-24T03:50:39,293 INFO [qtp168658781-1687] org.apache.druid.indexing.overlord.TaskLockbox - Task[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z] already present in TaskLock[index_parallel_w_jncihemc_2022-07-24T03:48:26.449Z]

That log would indicate that you are running into some serious lock issues, too.

Might I ask why so many tasks? I’ve only seen a factor fewer than that at a time, tbh.

Wouldn’t a tasklock already being present when an overlord is trying to start a task indicating one or both of the following?

  1. Split brain issue with Overlords both become master, I’ve seen this before when ZooKeeper nodes are flapping
  2. Inconsistent state with DB, is the Overlord leader getting different data when querying. Example: are you using something like a PostGres HA with PG Pool and there’s multiple primaries with data drift/inconsistencies?

To verify the Split brain with Overlord, you can curl the Overlords and see if they both have the leader mark

example:

$ curl -L -i -X GET http://REDACT.com:8090/druid/indexer/v1/isLeader
HTTP/1.1 200 OK
Date: Wed, 03 Aug 2022 14:48:53 GMT
Cache-Control: no-cache, no-store, max-age=0
Content-Security-Policy: frame-ancestors 'none'
Content-Type: application/json
Vary: Accept-Encoding, User-Agent
Transfer-Encoding: chunked
 
{"leader":true}

If 2 or more return true, you have a split brain with Overlord. Curl is a good way to identify this problem since the UI won’t be responsive during this time

For the Inconsistent DB issue, that may be hard/easy to identify depending on what you platform you are using. Could you provide that information? I’ve seen some users use the helm chart for PostgreSQL-HA and that has a known issue with split brain/inconsistencies as I’ve outlined here:

Btw that workaround is not consistent and I would NOT recommend you use this PG-HA implementation in Production.