Middlemanager workers not used

Hi guys,

I upgraded to 0.18.1 from 0.16.0 (new cluster) and one of the important feature to us, which worked well in 0.16.0 and prior, is giving us a lot of issues. I will try to outline as much info as I can. I really appreciate some help on this.

The basic problem is that the middle manager workers are not being used despite having a lot of pending tasks. For example, at one stage, this is what the UI showed.

xx.xx.xx.xx:8091
middle_manager
_default_worker_category
xx.xx.xx.xx
8091 (plain)

8 / 31 (slots)
Last completed task: 2020-06-05T15:24:00.104Z

xx.xx.xx.yy:8091
middle_manager
_default_worker_category
xx.xx.xx.yy
8091 (plain)

11 / 31 (slots)
Last completed task: 2020-06-05T15:25:32.203Z

xx.xx.xx.zz:8091
middle_manager
_default_worker_category
xx.xx.xx.zz
8091 (plain)

2 / 31 (slots)
Last completed task: 2020-06-05T15:25:08.204Z

I am using only 21 out of 93 available workers. I tried for multiple specs with different tuning configs and I face the same problem.

Case 1:

`

“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “dynamic”
},
“maxNumSubTasks”: 999
},

`

Case 2:

`

“tuningConfig”: {
“type”: “index_parallel”,
“partitionsSpec”: {
“type”: “single_dim”,
“partitionDimension”: “ctxid”,
“assumeGrouped”: true,
“maxRowsPerSegment”: 5000000
},
“forceGuaranteedRollup”: true,
“maxNumSubTasks”: 999
},

`

0.16.0 never had this issue and always utilized all the workers 100% of the time. Really appreciate any help I can get on this. Let me know if you need further information from me.

Any help would be appreciated.

Hi Karthik,

IMO, Just setting maxNumSubTasks to higher value does not mean that it will utilize no of worker thread specified, It can utilize up to that but if it does not need that many workers, it won’t.

What happens when you run new ingestion? Does it go into pending state despite having available workers?

Thanks and Regards,
Vaibhav

I would understand that if the jobs running are the last of the tasks. But if I have tasks in pending and waiting state, it should run on the free workers, right? in the above example, I have more tasks in pending but only 21 are running. this is stretching my etl time from 1.5 hours to 6 hours. that’s 3.5 hours of wasted EC2 resources, not to mention, I don’t know scaling it higher will run any more than already running tasks.