GCE PD Detach fails if node no longer exists

Problem:

If a node with a GCE PD attached is deleted (before the volume is detached), subsequent attempts by the attach/detach controller to detach it continuously fail, and prevent the controller from attaching the volume to another node.

Repro steps:

  1. Create a pod referencing a GCE PD
  2. Wait for pod to get scheduled and running.
  3. Delete the node VM (using gcloud) that the pod is scheduled to.
  4. Check if volume is detached by attach/detach controller:
    • Expected: Volume detached.
    • Actual: Volume continuously fails detach.

Logs:

Details From `/var/log/kube-controller-manager.log`:
I0721 03:21:43.464087       7 reconciler.go:134] Started DetachVolume for volume "X" from node "Y" due to maxWaitForUnmountDuration expiry.
E0721 03:21:43.591941       7 gce.go:2580] getInstanceByName/single-zone: failed to get instance Y; err: googleapi: Error 404: The resource 'projects/[project]/zones/[zone]/instances/Y' was not found, notFound
E0721 03:21:43.591985       7 attacher.go:215] Error checking if PD ("[pdname]") is already attached to current node ("Y"). Will continue and try detach anyway. err=instance not found
E0721 03:21:43.698786       7 gce.go:2580] getInstanceByName/single-zone: failed to get instance Y; err: googleapi: Error 404: The resource 'projects/[project]/zones/[zone]/instances/Y' was not found, notFound
E0721 03:21:43.698828       7 attacher.go:225] Error detaching PD "[pdname]" from node "Y": error getting instance "Y"

Workarounds:

  • Restart the kube-controller-manager binary on the master.

-or-

  • Recreate a node with the same name.

Proposed Fix:

If GCE PD detach fails with instance not found, assume successful detach.