GCE PD Detach fails if node no longer exists
Problem:
If a node with a GCE PD attached is deleted (before the volume is detached), subsequent attempts by the attach/detach controller to detach it continuously fail, and prevent the controller from attaching the volume to another node.
Repro steps:
- Create a pod referencing a GCE PD
- Wait for pod to get scheduled and running.
- Delete the node VM (using gcloud) that the pod is scheduled to.
- Check if volume is detached by attach/detach controller:
- Expected: Volume detached.
- Actual: Volume continuously fails detach.
Logs:
Details
From `/var/log/kube-controller-manager.log`:I0721 03:21:43.464087 7 reconciler.go:134] Started DetachVolume for volume "X" from node "Y" due to maxWaitForUnmountDuration expiry.
E0721 03:21:43.591941 7 gce.go:2580] getInstanceByName/single-zone: failed to get instance Y; err: googleapi: Error 404: The resource 'projects/[project]/zones/[zone]/instances/Y' was not found, notFound
E0721 03:21:43.591985 7 attacher.go:215] Error checking if PD ("[pdname]") is already attached to current node ("Y"). Will continue and try detach anyway. err=instance not found
E0721 03:21:43.698786 7 gce.go:2580] getInstanceByName/single-zone: failed to get instance Y; err: googleapi: Error 404: The resource 'projects/[project]/zones/[zone]/instances/Y' was not found, notFound
E0721 03:21:43.698828 7 attacher.go:225] Error detaching PD "[pdname]" from node "Y": error getting instance "Y"
Workarounds:
- Restart the kube-controller-manager binary on the master.
-or-
- Recreate a node with the same name.
Proposed Fix:
If GCE PD detach fails with instance not found, assume successful detach.