For the overlord API not responding during the 10-15 minute duration when handoff is happening, you may need to increase the number of http threads.
Other possible approaches/solutions being:
- Use a bigger instance type for your metadata RDS,
- Enable slow query logging to understand what queries are taking a long time,
- Add index/indices on the segments table. Internally, the following index helped us bring down the run time for a query executed after acquiring the “giant lock” from 1.6 seconds to 300 milliseconds:
CREATE INDEX idx_druid_segments_dsend on druid_segments(dataSource, end);