Prevent permanent worker deadlock when cutover times out waiting for binlog sentinel by VarunChandola · Pull Request #1637 · github/gh-ost
Fixes #971.
We recently hit a case where gh-ost stalled permanently mid-cutover: heartbeat lag grew
without bound and no further cutover attempts were made despite the process still running.
Background
As part of atomic cutover, gh-ost locks the original table, waits for all in-flight DML
to be applied to the ghost table, then executes the rename. The "wait for in-flight DML"
step works via a sentinel value written to the changelog table. This sentinel travels via
the binlog stream back to gh-ost, which applies binlog events sequentially through a
single-threaded applyEventsQueue. The sentinel is enqueued onto that same queue, so by
the time the applier executes it, all preceding DML is guaranteed to have been applied:
// onChangelogStateEvent — spawns a goroutine to enqueue the sentinel func var applyEventFunc tableWriteFunc = func() error { this.allEventsUpToLockProcessed <- &lockProcessedStruct{...} return nil } go func() { this.applyEventsQueue <- newApplyEventStructByFunc(&applyEventFunc) }()
waitForEventsUpToLock then blocks waiting to receive on allEventsUpToLockProcessed,
which only fires once the applier has worked through the queue and executed the sentinel.
The bug
Under high DML load the queue backlog is large enough that the sentinel isn't executed
before CutOverLockTimeoutSeconds expires. This is subtle: the logs show the sentinel
being intercepted quickly, which is easy to misread as the sentinel having been
delivered. Handled changelog state AllEventsUpToLockProcessed only means
onChangelogStateEvent spawned the goroutine to push onto applyEventsQueue — not that
the applier has executed it or that allEventsUpToLockProcessed has been written to.
From our logs:
09:56:49.616 Writing changelog state: AllEventsUpToLockProcessed:177240580961633980
09:56:49.616 Tables locked
09:56:49.620 Intercepted changelog state AllEventsUpToLockProcessed ← binlog event seen
09:56:49.620 Handled changelog state AllEventsUpToLockProcessed ← goroutine spawned
09:56:49.623 Waiting for events up to lock
09:56:52.617 Timeout while waiting for events up to lock ← 3s later, timed out
09:56:52.617 Dropping magic cut-over table
09:56:53.053 Tables unlocked
09:56:53.058 Looking for magic cut-over table ← no further attempts
waitForEventsUpToLock times out and exits, but the sentinel func is still in the queue.
When the applier eventually reaches it and executes it, it tries to send to
allEventsUpToLockProcessed — an unbuffered channel. The only goroutine that reads from
this channel is waitForEventsUpToLock, which has already exited. The applier blocks
permanently on that send.
On subsequent cutover attempts, we never reach waitForEventsUpToLock at all. The
sleepWhileTrue heartbeat lag guard at the top of cutOver() sees high heartbeat lag —
caused by the blocked applier, which has stopped processing all binlog events including
heartbeats — and spins forever waiting for it to recover. It never does. The migration is
permanently stalled with no error, no restart, and no further progress. This matches the
symptoms in #971.
Fix
Buffer allEventsUpToLockProcessed to MaxRetries(). There are arguably less wasteful
approaches, but this is the simplest: the applier's send always completes immediately
regardless of whether waitForEventsUpToLock is still listening. Stale sentinels from
timed-out attempts accumulate in the buffer and are discarded by the existing
state != challenge skip loop on subsequent calls to waitForEventsUpToLock.