We recently need to change our Druid cluster worker select strategy as we have 2 data sources, which needs different JVM setting. There are 4 middlemanagers, we want to set it up so that one KIS supervisor uses 1 middlemanager and the other KIS supervisor use the rest of the three middlemaangers. However, after submit worker selectStrategy, and restart overlord services and KIS service, I still see KIS supervisor not following what we configured for worker select strategy. Can anybody tell me what I did wrong here?
Below is our worker selectStrategy json:
We are currently on Druid 0.10.1 version. I can see in druid MySQL druid_config table, there is a record name called ‘worker.config’.
What do you mean exactly by “not following”?
Fwiw, in Druid 0.11.0 there will be a new concept of “strong affinity” that can apply to affinity configs. With weak affinity (the default), tasks for a dataSource may be assigned to other middleManagers if their affinity-mapped middleManagers are not able to run all pending tasks in the queue for that dataSource. With strong affinity, tasks for a dataSource will only ever be assigned to their affinity-mapped middleManagers, and will wait in the pending queue if necessary.
Might it be that you are getting the weak affinity behavior (survey_events tasks would run on other middleManagers if ip-xx-xxx-xx-xx.aws.foreseeresults.com:8091 is full)? If so, you could switch to strong affinity in 0.11.0.
After I implemented the worker select strategy and restarted overlord service, I also restarted survey_events supervisor, I can see there are 4 tasks running for this supervisor, but not all of the tasks are running on ip-xx-xxx-xx-xx.aws.foreseeresults.com node. Each of the middlemanager nodes has 8 workers, and this ip-xx-xxx-xx-xx.aws.foreseeresults.com:8091 node has 8 workers free. So I would expect all of the 4 tasks should run on it. However, there are only 2 tasks running on this node, the other 2 tasks always run on another node, which is not assigned to this datasource.
That’s why I wonder if there is anything I configure wrong? Why not all of the tasks for this data source running on the node when it still has capacity?
What is the replication factor for the ingestion? Replicant tasks
cannot run on the same physical node, otherwise losing that one node
would actually potentially lose all replicants of the data. If the
two tasks that are running on another node are replicants, that could
also explain it.
Thank you. That’s the reason! The task replication factor is 2.