[druid-user] Re: Tasks failing due to failed handoff when coordinator is busy

What’s your current maxSegmentsToMove setting?

Maybe you could reduce it to ensure only a certain number of instructions are circulated by the Historical at one time?

AHA it just occurs to me - I wonder if this could be not enough jetty threads on the historicals? Basic cluster tuning · Apache Druid

Any joy working this out OOI?

Hmmm I’m not 100% sure - I do wonder if there just not enough available jetty threads - but I also am not sure whether the Load / Drop for supervisors has any priority assigned to it so that they get loaded first —

Can I play back the issue - let me know if I have it wrong!!

  1. A Historical goes away on holiday
  2. The Historical comes back from holiday
  3. The coordinator starts to issue Load tasks to the Historical
  4. Supervised tasks go into a state waiting for hand-off
  5. The supervisor tasks sit and wait
  6. They wait more
  7. Nothing happens
  8. EITHER the task fails OR the load queue finally dissipates and the tasks complete OK

I could perhaps check in with some developers at Imply as well to see if they have any thoughts.

In my mind I remembered maxSegmentsInNodeLoadingQueue also - I wonder if you could maybe reduce that so that there aren’t too many segments all queued up to be loaded … maybe that will allow the tasks to implant themselves? (All theory!!)

Hi Diego, I have experienced similar behavior like this before, but, with compaction jobs. We simply ran out of worker capacity on the Middle Managers. So, we increased the ‘druid.worker.capacity’. The Coordinator console has some good stats to monitor ‘currCapacityUsed’ for remote workers. Hope that helps.

Sooooo sorry to be late replying. I’m asking some people here at Imply who know the code — I’ll let you know what I find out :smiley: