bug: stale sandbox name reservations block pod creation after node recovery

Describe the bug

After a node experiences issues (e.g. DRBD/kube-ovn problems during upgrade), containerd accumulates stale sandbox name reservations in its in-memory CRI index. These reservations permanently block new pods from being created on the affected node with the error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name
"<pod>_<namespace>_<uid>_0": name "<pod>_<namespace>_<uid>_0" is reserved for "<old-sandbox-id>"

The stale sandboxes cannot be cleaned up because:

  1. crictl rmp triggers CNI DEL, which calls multus → kube-ovn
  2. Multus has no CNI cache for these sandboxes (they never completed network setup)
  3. Without cache, multus passes an empty config to kube-ovn (server_socket missing)
  4. kube-ovn rejects the invalid config → DEL fails → sandbox cannot be removed → name reservation persists

This creates a deadlock: sandbox cleanup requires successful CNI DEL, but CNI DEL fails because the sandbox never got a network in the first place.

Additionally, the constant DEL retry storm overloads the multus daemon, causing new ADD requests to timeout with:

CmdAdd (shim): timed out waiting for the condition

Environment

  • Cozystack version: 1.2.0
  • Provider: on-prem (Talos Linux v1.12.1, containerd 2.1.6)
  • CNI: multus + kube-ovn + cilium

To Reproduce

  1. Have a node experience a disruption where containerd restarts or pods fail to start (e.g. storage/network issues during upgrade)
  2. Pods that were in the process of being created leave stale sandbox entries
  3. After recovery, new pods scheduled on the same node cannot start due to name reservation conflicts
  4. crictl rmp --force cannot clean them up due to CNI DEL failures

Expected behaviour

  • Stale sandbox name reservations should be cleaned up automatically after node recovery
  • CNI DEL for sandboxes that never completed network setup should be a no-op (return success)

Actual behaviour

  • Stale name reservations persist indefinitely in containerd's in-memory CRI index
  • CNI DEL fails for sandboxes without cached network config

Upstream issues

containerd:

multus:

Additional context
The root cause is a design gap between containerd and multus:

  • containerd should not call CNI DEL for sandboxes where ADD never completed
  • multus should gracefully handle DEL when no CNI cache exists (return success instead of failing)

Both sides have known open issues but no fix has been implemented yet.

Checklist

  • I have checked the documentation
  • I have searched for similar issues
  • I have included all required information
  • I have provided clear steps to reproduce
  • I have included relevant logs