I’m looking for recommendations for high availability for Overlorad with tranquility + indexing service setup.
I understand multiple overlord nodes can run at the same time and get elected as leader if the previous leader was died. However, the new leader will mark all the running task as failed. All the new messages pushed (service.apply(messages)) by Tranquility will be stuck at NoHostAvailable exception even though all the middle managers are available to take new tasks.
To make indexing continue after a new overlord take leadership or the old one restarted, I have do remove /tranquility dir in ZooKeeper so that new tasks will be up and running, this will lose some data but we think it’s better than having all the realtime ingestion stuck as we can always backfill with batch ingestion.
My questions is that for any reason why Overlord cannot simply restore state such that any new Overlord leader should be able to continue the running tasks? Currently it seems like overlord single node failure will cause ingestion to stop.