fix(probe,readiness): improve resilience to transient API server connectivity issues by armru · Pull Request #9148 · cloudnative-pg/cloudnative-pg

@dosubot dosubot bot added the size:M

This PR changes 30-99 lines, ignoring generated files.

label

Nov 12, 2025

@armru armru changed the title fix(probe,healthy): use the cluster cache when the api-server is unavailable fix(probe,healthy): use the cluster cache when the apiserver is unavailable

Nov 12, 2025

@armru armru changed the title fix(probe,healthy): use the cluster cache when the apiserver is unavailable fix(probe,healthy): use the local cache when the apiserver is unavailable

Nov 12, 2025

@dosubot dosubot bot added size:L

This PR changes 100-499 lines, ignoring generated files.

and removed size:M

This PR changes 30-99 lines, ignoring generated files.

labels

Nov 13, 2025

@armru armru changed the title fix(probe,healthy): use the local cache when the apiserver is unavailable fix(probe,readiness): improve probe resilience to transient API server connectivity issues

Nov 13, 2025

@armru armru changed the title fix(probe,readiness): improve probe resilience to transient API server connectivity issues fix(probe,readiness): improve resilience to transient API server connectivity issues

Nov 13, 2025
…ilable

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

armru

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

@mnencia

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

@mnencia mnencia deleted the dev/readiness-probe branch

November 28, 2025 15:51

cnpg-bot pushed a commit that referenced this pull request

Nov 28, 2025
…ectivity issues (#9148)

This change enhances the resilience of all probe types (liveness,
readiness, and startup) when facing transient Kubernetes API server
connectivity issues. Previously, readiness and startup probes would fail
immediately if unable to reach the API server, potentially causing
unnecessary pod restarts or preventing pods from becoming ready.

The improvement introduces a unified cluster caching mechanism that:

- Creates a **single shared cache** instance used across all three probe
  types (liveness, readiness, startup) to reduce memory usage and ensure
  consistency
- Implements **thread-safe** cache operations with proper mutex locking
  to support concurrent probe execution
- Attempts to fetch the cluster definition with a **500ms timeout** to
  avoid blocking the probe for too long
- **Falls back to a cached cluster definition** if the API server is
  temporarily unreachable
- **Falls back to default probe configuration** if no cached cluster is
  found
- Maintains probe functionality during brief network interruptions or
  API server unavailability
- Uses optimized memory allocation patterns to avoid unnecessary
  `DeepCopy` operations

This ensures consistent behavior across all probe types and reduces
false positives during transient network issues, while also improving
performance through shared resources and optimized memory usage.

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
(cherry picked from commit 1f11235)