[Druid 0.8.3] Overlord stuck after Mysql (Multi-AZ RDS) failover

Hi,

We experienced an event where realtime ingestion through Tranquility got stuck after Multi-AZ RDS Mysql recovered from fail-over. No error/exception can be found from the Overlord log. Overlord process was alive, but the [TaskQueue-StorageSync] thread seems stuck of died out. In the example below, no more log for “Synced X tasks” after 22:28:58,868 UTC. The Mysql instance had a partition/segment crash which triggered the failover. Last time RDS Mysql failover we had, Overlord reported something like “com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure”, but this time, no error was found at all

The mitigation was to restart Overlord, but wondering if this is a known bug or it was not noticed before?

RDS Event:

time (utc-7)	event
Sep 4 3:32 PM
Multi-AZ instance failover completed
Sep 4 3:31 PM
DB instance restarted
Sep 4 3:31 PM
Multi-AZ instance failover started

Overlord log grep for TaskQueue-StorageSync.
2016-09-04 22:25:58,873 INFO i.d.i.o.TaskQueue [TaskQueue-StorageSync] Synced 52 tasks from storage (0 tasks added, 0 tasks removed).
2016-09-04 22:26:58,868 INFO i.d.i.o.TaskQueue [TaskQueue-StorageSync] Synced 50 tasks from storage (0 tasks added, 0 tasks removed).
2016-09-04 22:27:58,868 INFO i.d.i.o.TaskQueue [TaskQueue-StorageSync] Synced 50 tasks from storage (0 tasks added, 0 tasks removed).
2016-09-04 22:28:58,868 INFO i.d.i.o.TaskQueue [TaskQueue-StorageSync] Synced 50 tasks from storage (0 tasks added, 0 tasks removed).

Tranquility log Exceptions:

com.metamx.tranquility.beam.ClusteredBeam: Emitting alert: [anomaly] Failed to create merged beam: druid:prod:overlord/data_source
{ }
java.lang.IllegalStateException: Failed to save new beam for identifier[druid:prod:overlord/data_source] timestamp[2016-09-04T23:00:00.000Z]

Hi Shuai, can you reproduce this issue in the latest stable?

Haven’t try that before, it’ll take some time to setup a local mysql and crash a partition intentionally. I’ll update if I can reproduce. Just want to check if others have seen this before