Prevent permanent worker deadlock when cutover times out waiting for binlog sentinel by VarunChandola · Pull Request #1637

Prevent permanent worker deadlock when cutover times out waiting for binlog sentinel by VarunChandola · Pull Request #1637 · github/gh-ost

Fixes #971.

We recently hit a case where gh-ost stalled permanently mid-cutover: heartbeat lag grew
without bound and no further cutover attempts were made despite the process still running.

Background

As part of atomic cutover, gh-ost locks the original table, waits for all in-flight DML
to be applied to the ghost table, then executes the rename. The "wait for in-flight DML"
step works via a sentinel value written to the changelog table. This sentinel travels via
the binlog stream back to gh-ost, which applies binlog events sequentially through a
single-threaded applyEventsQueue. The sentinel is enqueued onto that same queue, so by
the time the applier executes it, all preceding DML is guaranteed to have been applied:

// onChangelogStateEvent — spawns a goroutine to enqueue the sentinel func
var applyEventFunc tableWriteFunc = func() error {
    this.allEventsUpToLockProcessed <- &lockProcessedStruct{...}
    return nil
}
go func() {
    this.applyEventsQueue <- newApplyEventStructByFunc(&applyEventFunc)
}()

waitForEventsUpToLock then blocks waiting to receive on allEventsUpToLockProcessed,
which only fires once the applier has worked through the queue and executed the sentinel.

The bug

Under high DML load the queue backlog is large enough that the sentinel isn't executed
before CutOverLockTimeoutSeconds expires. This is subtle: the logs show the sentinel
being intercepted quickly, which is easy to misread as the sentinel having been
delivered. Handled changelog state AllEventsUpToLockProcessed only means
onChangelogStateEvent spawned the goroutine to push onto applyEventsQueue — not that
the applier has executed it or that allEventsUpToLockProcessed has been written to.

From our logs:

09:56:49.616  Writing changelog state: AllEventsUpToLockProcessed:177240580961633980
09:56:49.616  Tables locked
09:56:49.620  Intercepted changelog state AllEventsUpToLockProcessed   ← binlog event seen
09:56:49.620  Handled changelog state AllEventsUpToLockProcessed       ← goroutine spawned
09:56:49.623  Waiting for events up to lock
09:56:52.617  Timeout while waiting for events up to lock              ← 3s later, timed out
09:56:52.617  Dropping magic cut-over table
09:56:53.053  Tables unlocked
09:56:53.058  Looking for magic cut-over table                         ← no further attempts

waitForEventsUpToLock times out and exits, but the sentinel func is still in the queue.
When the applier eventually reaches it and executes it, it tries to send to
allEventsUpToLockProcessed — an unbuffered channel. The only goroutine that reads from
this channel is waitForEventsUpToLock, which has already exited. The applier blocks
permanently on that send.

On subsequent cutover attempts, we never reach waitForEventsUpToLock at all. The
sleepWhileTrue heartbeat lag guard at the top of cutOver() sees high heartbeat lag —
caused by the blocked applier, which has stopped processing all binlog events including
heartbeats — and spins forever waiting for it to recover. It never does. The migration is
permanently stalled with no error, no restart, and no further progress. This matches the
symptoms in #971.

Fix

Buffer allEventsUpToLockProcessed to MaxRetries(). There are arguably less wasteful
approaches, but this is the simplest: the applier's send always completes immediately
regardless of whether waitForEventsUpToLock is still listening. Stale sentinels from
timed-out attempts accumulate in the buffer and are discarded by the existing
state != challenge skip loop on subsequent calls to waitForEventsUpToLock.