fix(tailnet): retry after transport dial timeouts by ethanndickson · Pull Request #22977 · coder/coder
This fixes the desktop reconnect bug where the vpn daemon can stay
"connected" after a VPN or Wi-Fi transition but silently stop talking
to coderd until the user toggles Coder Connect off and on.
Problem
The tailnet controller owns a long-lived retry loop in Controller.Run.
That loop already had an important graceful-shutdown guard added in
November 2024 to prevent a phantom redial after cancellation:
if c.ctx.Err() != nil { return }
That guard was correct. It made controller lifetime depend on the
controller's own context rather than on retry timing races.
But the post-dial error path had since grown a broader terminal check:
if xerrors.Is(err, context.Canceled) ||
xerrors.Is(err, context.DeadlineExceeded) {
return
}
That turns out to be too broad for desktop reconnects. A dial attempt can
fail with a wrapped context.DeadlineExceeded even while the controller's
own context is still live.
Why that happens
The workspace tailnet dialer uses the SDK HTTP client, which inherits
http.DefaultTransport. That transport uses a net.Dialer with a 30s
Timeout. Go implements that timeout by creating an internal deadline-
bound sub-context for the TCP connect.
So during a control-plane blackhole, the reconnect path can look like
this:
- the existing control-plane connection dies,
- Controller.Run re-enters the retry path,
- the next websocket/TCP dial hangs against unreachable coderd,
- net.Dialer times out the connect after ~30s,
- the returned error unwraps to context.DeadlineExceeded,
- Controller.Run treats that as terminal and returns,
- the retry goroutine exits forever even though c.ctx is still alive.
At that point the data plane can remain partially alive, the desktop app
can still look online, and unblocking coderd does nothing because the
process is no longer trying to redial.
How this was found
We reproduced the issue in the macOS vpn-daemon process with temporary
diagnostics, blackholed coderd with pfctl, and captured multiple
goroutine dumps while the daemon was wedged.
Those dumps showed:
- manageGracefulTimeout was still blocked on <-c.ctx.Done(), proving the
controller context was not canceled,
- the Controller.Run retry goroutine was missing from later dumps,
- control-plane consumers stayed idle longer over time,
- and once coderd became reachable again the daemon still did not dial
it.
That narrowed the failure from "slow retry" to "retry loop exited", and
tracing the dial path back through http.DefaultTransport and net.Dialer
explained why a transport timeout was being mistaken for controller
shutdown.
Fix
Keep the November 2024 graceful-shutdown guard exactly as-is, but narrow
the post-dial exit condition so it keys off the controller's own context
instead of the error unwrap chain.
Before:
if xerrors.Is(err, context.Canceled) ||
xerrors.Is(err, context.DeadlineExceeded) {
return
}
After:
if c.ctx.Err() != nil {
return
}
Why this is the right behavior
This preserves the original graceful-shutdown invariant from the 2024
fix while restoring retryability for transient transport failures:
- if c.ctx is canceled before dialing, the pre-dial guard still prevents
a phantom redial,
- if c.ctx is canceled during a dial attempt, the error path still exits
cleanly because c.ctx.Err() is non-nil,
- if a live controller hits a wrapped transport timeout, the loop no
longer dies and instead retries as intended.
In other words, controller state remains the only authoritative signal
for loop shutdown.
Non-regression coverage
This also preserves the earlier flaky-test fix from ba21ba8:
- pipeDialer still returns errors instead of asserting from background
goroutines,
- TestController_Disconnects still waits for uut.Closed() before the
test exits.
On top of that, this change adds focused controller tests that assert:
- a wrapped net.OpError(context.DeadlineExceeded) under a live
controller causes another dial attempt instead of closing the
controller, and
- cancellation still shuts the controller down without an extra redial.
Validation
Ran focused validation in the touched package:
- go test ./tailnet -run 'TestController_(Disconnects|RetriesWrappedDeadlineExceeded|DoesNotRedialAfterCancel)$'
- go test ./tailnet
Manual validation also matched the intended behavior: after blocking
coderd long enough to force the controller into the retry path,
unblocking coderd allowed the daemon to recover on its own without
toggling Coder Connect.