fix(tailnet): retry after transport dial timeouts by ethanndickson · Pull Request #22977

fix(tailnet): retry after transport dial timeouts by ethanndickson · Pull Request #22977 · coder/coder
This fixes the desktop reconnect bug where the vpn daemon can stay
"connected" after a VPN or Wi-Fi transition but silently stop talking
to coderd until the user toggles Coder Connect off and on.

Problem

The tailnet controller owns a long-lived retry loop in Controller.Run.
That loop already had an important graceful-shutdown guard added in
November 2024 to prevent a phantom redial after cancellation:

    if c.ctx.Err() != nil { return }

That guard was correct. It made controller lifetime depend on the
controller's own context rather than on retry timing races.

But the post-dial error path had since grown a broader terminal check:

    if xerrors.Is(err, context.Canceled) ||
       xerrors.Is(err, context.DeadlineExceeded) {
        return
    }

That turns out to be too broad for desktop reconnects. A dial attempt can
fail with a wrapped context.DeadlineExceeded even while the controller's
own context is still live.

Why that happens

The workspace tailnet dialer uses the SDK HTTP client, which inherits
http.DefaultTransport. That transport uses a net.Dialer with a 30s
Timeout. Go implements that timeout by creating an internal deadline-
bound sub-context for the TCP connect.

So during a control-plane blackhole, the reconnect path can look like
this:

- the existing control-plane connection dies,
- Controller.Run re-enters the retry path,
- the next websocket/TCP dial hangs against unreachable coderd,
- net.Dialer times out the connect after ~30s,
- the returned error unwraps to context.DeadlineExceeded,
- Controller.Run treats that as terminal and returns,
- the retry goroutine exits forever even though c.ctx is still alive.

At that point the data plane can remain partially alive, the desktop app
can still look online, and unblocking coderd does nothing because the
process is no longer trying to redial.

How this was found

We reproduced the issue in the macOS vpn-daemon process with temporary
diagnostics, blackholed coderd with pfctl, and captured multiple
goroutine dumps while the daemon was wedged.

Those dumps showed:

- manageGracefulTimeout was still blocked on <-c.ctx.Done(), proving the
  controller context was not canceled,
- the Controller.Run retry goroutine was missing from later dumps,
- control-plane consumers stayed idle longer over time,
- and once coderd became reachable again the daemon still did not dial
  it.

That narrowed the failure from "slow retry" to "retry loop exited", and
tracing the dial path back through http.DefaultTransport and net.Dialer
explained why a transport timeout was being mistaken for controller
shutdown.

Fix

Keep the November 2024 graceful-shutdown guard exactly as-is, but narrow
the post-dial exit condition so it keys off the controller's own context
instead of the error unwrap chain.

Before:

    if xerrors.Is(err, context.Canceled) ||
       xerrors.Is(err, context.DeadlineExceeded) {
        return
    }

After:

    if c.ctx.Err() != nil {
        return
    }

Why this is the right behavior

This preserves the original graceful-shutdown invariant from the 2024
fix while restoring retryability for transient transport failures:

- if c.ctx is canceled before dialing, the pre-dial guard still prevents
  a phantom redial,
- if c.ctx is canceled during a dial attempt, the error path still exits
  cleanly because c.ctx.Err() is non-nil,
- if a live controller hits a wrapped transport timeout, the loop no
  longer dies and instead retries as intended.

In other words, controller state remains the only authoritative signal
for loop shutdown.

Non-regression coverage

This also preserves the earlier flaky-test fix from ba21ba8:

- pipeDialer still returns errors instead of asserting from background
  goroutines,
- TestController_Disconnects still waits for uut.Closed() before the
  test exits.

On top of that, this change adds focused controller tests that assert:

- a wrapped net.OpError(context.DeadlineExceeded) under a live
  controller causes another dial attempt instead of closing the
  controller, and
- cancellation still shuts the controller down without an extra redial.

Validation

Ran focused validation in the touched package:

- go test ./tailnet -run 'TestController_(Disconnects|RetriesWrappedDeadlineExceeded|DoesNotRedialAfterCancel)$'
- go test ./tailnet

Manual validation also matched the intended behavior: after blocking
coderd long enough to force the controller into the retry path,
unblocking coderd allowed the daemon to recover on its own without
toggling Coder Connect.