bug: reaper discards child exit status, masking agent crash root cause
Problem
Container restarts are occurring with termination reason "Completed" and exit code 0, despite the agent process likely crashing. This makes root cause analysis impossible.
Observed Behavior
- Container restarts correlate with
yamux: Failed to read header: failed to read frame header: EOFon the server - All restarts show exit code 0 with "Completed" — regardless of actual cause
- No OOM pattern (memory usage varied from 9% to 89% at crash time)
- No server-side disconnect or timeout logs precede the crashes
Root Cause Analysis
The agent runs as a child of a reaper process (PID 1). In agent/reaper/reaper_unix.go, when the child exits, the reaper calls Wait4 but discards the exit status entirely and exits 0. This masks:
- Panic crashes (exit code 2)
- SIGKILL from cgroup limits
- Any other termination signal
Additionally, there is no recover() in the agent's production code — any goroutine panic crashes the process immediately with no captured output.
Proposed Fixes
-
Reaper logging: Patch the reaper to log the child's exit status (
wstatusfromWait4) before exiting. This would immediately reveal on the next crash whether it's a panic, SIGKILL, or other cause. -
Panic recovery: Ensure every goroutine has a deferred
recover()declared at the top of its function. This would catch panics, log the stack trace, and prevent silent crashes that are impossible to diagnose.
Environment
- Kubernetes deployment with
singleProcessOOMKillenabled - 220GB memory limit