bug: reaper discards child exit status, masking agent crash root cause

Problem

Container restarts are occurring with termination reason "Completed" and exit code 0, despite the agent process likely crashing. This makes root cause analysis impossible.

Observed Behavior

  • Container restarts correlate with yamux: Failed to read header: failed to read frame header: EOF on the server
  • All restarts show exit code 0 with "Completed" — regardless of actual cause
  • No OOM pattern (memory usage varied from 9% to 89% at crash time)
  • No server-side disconnect or timeout logs precede the crashes

Root Cause Analysis

The agent runs as a child of a reaper process (PID 1). In agent/reaper/reaper_unix.go, when the child exits, the reaper calls Wait4 but discards the exit status entirely and exits 0. This masks:

  • Panic crashes (exit code 2)
  • SIGKILL from cgroup limits
  • Any other termination signal

Additionally, there is no recover() in the agent's production code — any goroutine panic crashes the process immediately with no captured output.

Proposed Fixes

  1. Reaper logging: Patch the reaper to log the child's exit status (wstatus from Wait4) before exiting. This would immediately reveal on the next crash whether it's a panic, SIGKILL, or other cause.

  2. Panic recovery: Ensure every goroutine has a deferred recover() declared at the top of its function. This would catch panics, log the stack trace, and prevent silent crashes that are impossible to diagnose.

Environment

  • Kubernetes deployment with singleProcessOOMKill enabled
  • 220GB memory limit