bug: stale sandbox name reservations block pod creation after node recovery

Describe the bug

After a node experiences issues (e.g. DRBD/kube-ovn problems during upgrade), containerd accumulates stale sandbox name reservations in its in-memory CRI index. These reservations permanently block new pods from being created on the affected node with the error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name
"<pod>_<namespace>_<uid>_0": name "<pod>_<namespace>_<uid>_0" is reserved for "<old-sandbox-id>"

The stale sandboxes cannot be cleaned up because:

crictl rmp triggers CNI DEL, which calls multus → kube-ovn
Multus has no CNI cache for these sandboxes (they never completed network setup)
Without cache, multus passes an empty config to kube-ovn (server_socket missing)
kube-ovn rejects the invalid config → DEL fails → sandbox cannot be removed → name reservation persists

This creates a deadlock: sandbox cleanup requires successful CNI DEL, but CNI DEL fails because the sandbox never got a network in the first place.

Additionally, the constant DEL retry storm overloads the multus daemon, causing new ADD requests to timeout with:

CmdAdd (shim): timed out waiting for the condition

Environment

Cozystack version: 1.2.0
Provider: on-prem (Talos Linux v1.12.1, containerd 2.1.6)
CNI: multus + kube-ovn + cilium

To Reproduce

Have a node experience a disruption where containerd restarts or pods fail to start (e.g. storage/network issues during upgrade)
Pods that were in the process of being created leave stale sandbox entries
After recovery, new pods scheduled on the same node cannot start due to name reservation conflicts
crictl rmp --force cannot clean them up due to CNI DEL failures

Expected behaviour

Stale sandbox name reservations should be cleaned up automatically after node recovery
CNI DEL for sandboxes that never completed network setup should be a no-op (return success)

Actual behaviour

Stale name reservations persist indefinitely in containerd's in-memory CRI index
CNI DEL fails for sandboxes without cached network config

Upstream issues

containerd:

Fix Plan: failed to recover state: failed to reserve container name xxx: name xxx is reserved for xxx containerd/containerd#11504 — Fix plan for "failed to reserve container name" (open, not yet implemented)
After my machine was powered on, all pods did not wake up and reported the error "failed to reserve sandbox name \" xxxx \ ": name \" xxx \ "is reserved for \" 9d423d30a81701807be0b8664a85380287b605ddd915c4fa8ae195801b25ba9e \ "" containerd/containerd#10842 — Same issue after power cycle (open, stale)
RunPodSandBox times out (4 min ttl in kubelet) and when kubelet tries to recreate the pod it conflicts. containerd/containerd#12438 — RunPodSandBox timeout causes name reservation conflicts (closed)

multus:

one pod failed k8snetworkplumbingwg/multus-cni#1454 — Same conflistDel error with calico (open)
Multus: failed to get the cached delegates file k8snetworkplumbingwg/multus-cni#1446 — "failed to get the cached delegates file" on DEL (closed/stale)
[MULTUS] Pod sandbox changed, post deletion of pod it doesnot come back up k8snetworkplumbingwg/multus-cni#1448 — Pod sandbox changed, pod doesn't come back up (closed/stale)

Additional context
The root cause is a design gap between containerd and multus:

containerd should not call CNI DEL for sandboxes where ADD never completed
multus should gracefully handle DEL when no CNI cache exists (return success instead of failing)

Both sides have known open issues but no fix has been implemented yet.

Checklist

I have checked the documentation
I have searched for similar issues
I have included all required information
I have provided clear steps to reproduce
I have included relevant logs