fix(probe,readiness): improve resilience to transient API server connectivity issues by armru · Pull Request #9148 · cloudnative-pg/cloudnative-pg
label
Nov 12, 2025
armru
changed the title
fix(probe,healthy): use the cluster cache when the api-server is unavailable
fix(probe,healthy): use the cluster cache when the apiserver is unavailable
armru
changed the title
fix(probe,healthy): use the cluster cache when the apiserver is unavailable
fix(probe,healthy): use the local cache when the apiserver is unavailable
and removed size:M
This PR changes 30-99 lines, ignoring generated files.labels
Nov 13, 2025
armru
changed the title
fix(probe,healthy): use the
fix(probe,readiness): improve probe resilience to transient API server connectivity issueslocal cache when the apiserver is unavailable
armru
changed the title
fix(probe,readiness): improve probe resilience to transient API server connectivity issues
fix(probe,readiness): improve resilience to transient API server connectivity issues
mnencia
deleted the
dev/readiness-probe
branch
cnpg-bot pushed a commit that referenced this pull request
Nov 28, 2025…ectivity issues (#9148) This change enhances the resilience of all probe types (liveness, readiness, and startup) when facing transient Kubernetes API server connectivity issues. Previously, readiness and startup probes would fail immediately if unable to reach the API server, potentially causing unnecessary pod restarts or preventing pods from becoming ready. The improvement introduces a unified cluster caching mechanism that: - Creates a **single shared cache** instance used across all three probe types (liveness, readiness, startup) to reduce memory usage and ensure consistency - Implements **thread-safe** cache operations with proper mutex locking to support concurrent probe execution - Attempts to fetch the cluster definition with a **500ms timeout** to avoid blocking the probe for too long - **Falls back to a cached cluster definition** if the API server is temporarily unreachable - **Falls back to default probe configuration** if no cached cluster is found - Maintains probe functionality during brief network interruptions or API server unavailability - Uses optimized memory allocation patterns to avoid unnecessary `DeepCopy` operations This ensures consistent behavior across all probe types and reduces false positives during transient network issues, while also improving performance through shared resources and optimized memory usage. Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit 1f11235)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters