Adding CancelDrainTask to ASG termination to close orphaned generated heartbeat from nodes failing to cordon and drain by pshand-1 · Pull Request #1173 · aws/aws-node-termination-handler
Issue
Fixes #1172
Problem Description
The Node Termination Handler has a critical bug in ASG termination event handling that creates orphaned heartbeat goroutines when node drain operations fail.
Current Behavior (Buggy)
When an ASG termination event fails to drain a node:
PreDrainTaskstarts a heartbeat goroutinecordonAndDrainNodefails to evict podsCancelInterruptionEventremoves the event but never stops the heartbeat- The heartbeat goroutine continues running indefinitely, sending heartbeats every 30 seconds
Impact
- Resource leak: Failed drain attempts create orphaned goroutines
- Cascading failure: Events retry every 20 seconds, creating new orphaned heartbeats each time
- AWS API spam: Orphaned heartbeats continue sending unnecessary API calls
Solution
Implemented a CancelDrainTask mechanism that mirrors the existing PreDrainTask/PostDrainTask pattern to properly terminate heartbeats on drain failures.
Key Changes
pkg/monitor/sqsevent/asg-lifecycle-event.go
- Added
cancelHeartbeatChchannel for heartbeat cancellation - Created
CancelDrainTaskfunction to close the cancel channel - Enhanced
SendHeartbeatsto listen for cancellation signals - Added proper logging for heartbeat cancellation events
pkg/interruptionevent/draincordon/handler.go
- Added drain failure detection in the error handling path
- Calls
RunCancelDrainTaskwhen drain operations fail andCancelDrainTaskexists - Maintains existing error handling flow while adding cleanup
pkg/monitor/sqsevent/sqs-monitor_test.go
- Added unit tests for
CancelDrainTaskcreation and execution - Added heartbeat cancellation test to verify proper termination
- Integrated with existing test patterns
Testing
Automated Tests (All Passing)
- ✅ Unit tests (
make unit-test) - ✅ E2E tests (
make e2e-test) - ✅ Compatibility tests (
make compatibility-test) - ✅ License validation (
make license-test) - ✅ Linting (
make go-linter) - ✅ Helm validation (
make helm-lint) - ✅ Spell check (
make spellcheck)
Tested on: macOS (ARM64) (also ran make unit-test on Linux x86_64)
Kubernetes Version: 1.30
Manual Validation
Scenario: Deployed NTH in EKS cluster and blocked Kubernetes API calls to simulate drain failures
Before Fix:
- New heartbeats created every 20 seconds
- Heartbeats continue indefinitely (tested over 2+ hours)
- Multiple orphaned goroutines accumulating
After Fix:
2025/06/26 23:15:30 INF Failed to cordon and drain the node, stopping heartbeat asgName=...
- Heartbeat properly terminated on drain failure
- No orphaned goroutines
- Clean event cleanup
Backward Compatibility
- ✅ No breaking changes to existing APIs
- ✅ Maintains existing successful drain flow
- ✅ Only adds cleanup for failure scenarios
- ✅
CancelDrainTaskis optional (nil-safe)
Code Implementation
- Follows existing code patterns and conventions
- Comprehensive error handling and logging
- Proper resource cleanup
Possible Reproduction Steps (for verification):
- Create two EKS Clusters
- Deploy NTH on one of them with
deleteSqsMsgIfNodeNotFound=false - Terminate instance on different cluster with same tag as NTH managed tag
- Observe repeated drain failures creating orphaned heartbeats
- With fix: heartbeats properly terminate on failure
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.