GitHub - cloudnative-pg/chaos-testing: Chaos testing project to enhance CloudnativePG's resilience and fault tolerance

CloudNativePG Chaos Testing with Jepsen

Production-ready Jepsen and Litmus chaos automation for CloudNativePG (CNPG) clusters.

🚀 Quick Start

Want to run chaos testing immediately? Follow these streamlined steps:

Clone this repo → Get the chaos experiments and scripts (section 0)
Setup cluster → Bootstrap CNPG Playground (section 1)
Install CNPG → Deploy operator + sample cluster (section 2)
Install Litmus → Install operator, experiments, and RBAC (sections 3, 3.5, 3.6)
Smoke-test chaos → Run the quick pod-delete check without monitoring (section 4)
Add monitoring → Install Prometheus for probe validation (section 5; required before section 6 with probes enabled)
Run Jepsen → Full consistency testing layered on chaos (section 6)

First time users: Use section 4 as a smoke test without Prometheus, then return to section 5 to install monitoring before running the Jepsen workflow in section 6.

✅ Prerequisites

Linux/macOS shell with bash, git, curl, jq, and internet access.
Container + Kubernetes tooling: Docker or Podman, the Kind CLI tool, kubectl, helm, the kubectl cnpg plugin binary, and the cmctl utility for cert-manager.
Install the CNPG plugin using kubectl krew (recommended):
```
# Install or update to the latest version
kubectl krew update
kubectl krew install cnpg || kubectl krew upgrade cnpg
kubectl cnpg version
```
Alternative installation methods:
- For Debian/Ubuntu: Download .deb from releases page
- For RHEL/Fedora: Download .rpm from releases page
- See official installation docs for all methods
Optional but recommended: kubectx, stern, kubectl-view-secret (see the CNPG Playground README for a complete list).
Disk Space: Minimum 30GB free disk space recommended:
- Kind cluster nodes: ~5GB
- Container images: ~5GB (first run with image pull)
- Prometheus/MongoDB storage: ~10GB
- Jepsen results + logs: ~5GB
- Buffer for growth: ~5GB
Sufficient local resources for a multi-node Kind cluster (≈8 CPUs / 12 GB RAM) and permission to run port-forwards.

Once the tooling is present, everything else is managed via repository scripts and Helm charts.

⚡ Setup and Configuration

Follow these sections in order; each references the authoritative upstream documentation to keep this README concise.

0. Clone the Chaos Testing Repository

First, clone this repository to access the chaos experiments and scripts:

git clone https://github.com/cloudnative-pg/chaos-testing.git
cd chaos-testing

All subsequent commands reference files in this repository (experiments, scripts, monitoring configs).

1. Bootstrap the CNPG Playground

The upstream documentation provides detailed instructions for prerequisites and networking. Follow the setup instructions here: https://github.com/cloudnative-pg/cnpg-playground#usage.

Deploy the cnpg-playground project in a parallel folder to chaos-testing:

cd ..
git clone https://github.com/cloudnative-pg/cnpg-playground.git
cd cnpg-playground
./scripts/setup.sh eu         # creates kind-k8s-eu cluster

Follow the instructions on the screen. In particular, make sure that you:

export the KUBECONFIG variable, as described
set the correct context for kubectl

For example:

export KUBECONFIG=<PATH_TO_CNPG_PLAYGROUND>/k8s/kube-config.yaml
kubectl config use-context kind-k8s-eu

If unsure, type:

./scripts/info.sh             # displays contexts and access information

2. Install CloudNativePG and Create the PostgreSQL Cluster

With the Kind cluster running, install the operator using the kubectl cnpg plugin as recommended in the CloudNativePG Installation & Upgrades guide. This approach ensures you get the latest stable operator version:

In the cnpg-playground folder:

# Install the latest operator version using the kubectl cnpg plugin
kubectl cnpg install generate --control-plane | \
  kubectl --context kind-k8s-eu apply -f - --server-side

# Verify the controller rollout
kubectl --context kind-k8s-eu rollout status deployment \
  -n cnpg-system cnpg-controller-manager

In the chaos-testing folder:

cd ../chaos-testing
# Create the pg-eu PostgreSQL cluster for chaos testing
kubectl apply -f clusters/pg-eu-cluster.yaml

# Verify cluster is ready (this will watch until healthy)
kubectl get cluster pg-eu -w  # Wait until status shows "Cluster in healthy state"
# Press Ctrl+C when you see: pg-eu   3       3   ready   XX m

3. Install Litmus Chaos

Litmus 3.x separates the operator (via litmus-core) from the ChaosCenter UI (via litmus chart). Install both, then add the experiment definitions and RBAC:

# Add Litmus Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Install litmus-core (operator + CRDs)
helm upgrade --install litmus-core litmuschaos/litmus-core \
  --namespace litmus --create-namespace \
  --wait --timeout 10m

# Verify CRDs are installed
kubectl get crd chaosengines.litmuschaos.io chaosexperiments.litmuschaos.io chaosresults.litmuschaos.io

# Verify operator is running
kubectl -n litmus get deploy litmus
kubectl -n litmus wait --for=condition=Available deployment/litmus --timeout=5m

3.5. Install ChaosExperiment Definitions

The ChaosEngine requires ChaosExperiment resources to exist before it can run. Install the pod-delete experiment:

# Install from Chaos Hub (has namespace: default hardcoded, so override it)
kubectl apply --namespace=litmus -f https://hub.litmuschaos.io/api/chaos/master?file=faults/kubernetes/pod-delete/fault.yaml

# Verify experiment is installed
kubectl -n litmus get chaosexperiments
# Should show: pod-delete

3.6. Configure RBAC for Chaos Experiments

Apply the RBAC configuration and verify the service account has correct permissions:

# Apply RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding)
kubectl apply -f litmus-rbac.yaml

# Verify the ServiceAccount exists in litmus namespace
kubectl -n litmus get serviceaccount litmus-admin

# Verify the ClusterRoleBinding points to correct namespace
kubectl get clusterrolebinding litmus-admin -o jsonpath='{.subjects[0].namespace}'
# Should output: litmus (not default)

# Test permissions (optional)
kubectl auth can-i delete pods --as=system:serviceaccount:litmus:litmus-admin -n default
# Should output: yes

Important: The litmus-rbac.yaml ClusterRoleBinding must reference namespace: litmus in the subjects section. If you see errors like "litmus-admin" cannot get resource "chaosengines", verify the namespace matches where the ServiceAccount exists.

4. (Optional) Test Chaos Without Monitoring

Before setting up the full monitoring stack, you can verify chaos mechanics work independently:

# Apply the probe-free chaos engine (no Prometheus dependency)
kubectl apply -f experiments/cnpg-jepsen-chaos-noprobes.yaml

# Watch the chaos runner pod start (refreshes every 2s)
# Press Ctrl+C once you see the runner pod appear
watch -n2 'kubectl -n litmus get pods | grep cnpg-jepsen-chaos-noprobes-runner'

# Monitor CNPG pod deletions in real-time
bash scripts/monitor-cnpg-pods.sh pg-eu default litmus kind-k8s-eu

# Wait for chaos runner pod to be created, then check logs
kubectl -n litmus wait --for=condition=ready pod -l chaos-runner-name=cnpg-jepsen-chaos-noprobes --timeout=60s && \
runner_pod=$(kubectl -n litmus get pods -l chaos-runner-name=cnpg-jepsen-chaos-noprobes -o jsonpath='{.items[0].metadata.name}') && \
kubectl -n litmus logs -f "$runner_pod"

# After completion, check the result (engine name differs)
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-noprobes-pod-delete -o jsonpath='{.status.experimentStatus.verdict}'
# Should output: Pass (if probes are disabled) or Error (if Prometheus probes enabled but Prometheus not installed)

# Clean up for next test
kubectl -n litmus delete chaosengine cnpg-jepsen-chaos-noprobes

What to observe:

The runner pod starts and creates an experiment pod (pod-delete-xxxxx)
CNPG primary pods are deleted every 60 seconds
CNPG automatically promotes a replica to primary after each deletion
Deleted pods are recreated by the StatefulSet controller
The experiment runs for 10 minutes (TOTAL_CHAOS_DURATION=600)

Note: Keep using experiments/cnpg-jepsen-chaos-noprobes.yaml until Section 5 installs Prometheus/Grafana. Once monitoring is online, switch to experiments/cnpg-jepsen-chaos.yaml (probes enabled) for full observability.

5. Configure monitoring (Prometheus + Grafana)

The cnpg-playground provides a built-in monitoring stack with Prometheus and Grafana. From the cnpg-playground directory:

cd ../cnpg-playground
./monitoring/setup.sh eu

This script installs:

Prometheus Operator (in prometheus-operator namespace)
Grafana Operator with the official CloudNativePG dashboard (in grafana namespace)
Auto-configured for the kind-k8s-eu cluster

Once installation completes, create the PodMonitor to expose CNPG metrics:

# Switch back to chaos-testing directory
cd ../chaos-testing

# Apply CNPG PodMonitor
kubectl apply -f monitoring/podmonitor-pg-eu.yaml

# Verify PodMonitor
kubectl get podmonitor pg-eu -o wide

# Verify Prometheus is scraping CNPG metrics
kubectl -n prometheus-operator port-forward svc/prometheus-operated 9090:9090 &
curl -s --data-urlencode 'query=sum(cnpg_collector_up{cluster="pg-eu"})' "http://localhost:9090/api/v1/query"

Access Grafana dashboard:

kubectl -n grafana port-forward svc/grafana-service 3000:3000

# Open http://localhost:3000 with:
#   Username: admin
#   Password: admin (you'll be prompted to change on first login)

The official CloudNativePG dashboard is pre-configured and available at: Home → Dashboards → grafana → CloudNativePG

Note: If you recreate the pg-eu cluster, reapply the PodMonitor so Prometheus resumes scraping: kubectl apply -f monitoring/podmonitor-pg-eu.yaml

✅ Required before section 6 (when probes are enabled): Complete this monitoring setup so the Prometheus probes defined in experiments/cnpg-jepsen-chaos.yaml can succeed.

Dependency on cnpg-playground

This project relies on cnpg-playground's monitoring implementation. Be aware of the following dependencies:

What we depend on:

Script: /path/to/cnpg-playground/monitoring/setup.sh
Namespace: prometheus-operator
Service: prometheus-operated (created by Prometheus Operator for CR named prometheus)
Port: 9090 (Prometheus default)

If cnpg-playground monitoring changes, you may need to update:

Prometheus endpoint in experiments/cnpg-jepsen-chaos.yaml (lines 89, 132, 148)
Service check in .github/workflows/chaos-test-full.yml (line 57)
Service check in scripts/run-jepsen-chaos-test.sh (line 279)

Troubleshooting: If probes fail with connection errors:

# Verify the Prometheus service exists
kubectl -n prometheus-operator get svc

# If service name changed, update all probe endpoints
# in experiments/cnpg-jepsen-chaos.yaml

6. Run the Jepsen chaos test

./scripts/run-jepsen-chaos-test.sh pg-eu app 600

This script deploys Jepsen (jepsenpg image), applies the Litmus ChaosEngine (primary pod delete), monitors logs, collects results, and cleans up transient resources automatically (no manual exit needed - the script handles everything).

Prerequisites before running the script:

Section 5 completed (Prometheus/Grafana running) so probes succeed.
Chaos workflow validated (run experiments/cnpg-jepsen-chaos.yaml once manually if you need to confirm Litmus + CNPG wiring).
Docker registry access to pull ardentperf/jepsenpg image (or pre-pulled into cluster).
kubectl context pointing to the playground cluster with sufficient resources.
Increase max open files limit if needed (required for Jepsen on some systems):

This may need to be configured in your container runtime or Kind cluster configuration if running in a containerized environment.

Script knobs:

LITMUS_NAMESPACE (default litmus) – set if you installed Litmus in a different namespace.
PROMETHEUS_NAMESPACE (default prometheus-operator) – used to auto-detect the Prometheus service backing Litmus probes.
JEPSEN_IMAGE is pinned to ardentperf/jepsenpg@sha256:4a3644d9484de3144ad2ea300e1b66568b53d85a87bf12aa64b00661a82311ac for reproducibility. Update this digest only after verifying upstream releases.

7. Inspect test results

All test results are stored under logs/jepsen-chaos-<timestamp>/.

Quick validation commands:

# Check Litmus chaos verdict (note: use -n litmus, not -n default)
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete \
   -o jsonpath='{.status.experimentStatus.verdict}'

# View full chaos result details
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete -o yaml

# Check probe results (if Prometheus was installed)
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete \
   -o jsonpath='{.status.probeStatuses}' | jq

Archive history.edn and chaos-results/chaosresult.yaml for analysis or reporting.

📦 Results & logs

Each run creates a folder under logs/jepsen-chaos-<timestamp>/.
Key files:
- results/history.edn → Jepsen operation history.
- results/chaos-results/chaosresult.yaml → Litmus verdict + probe output.

Quick checks:

# Chaos results (note: namespace is 'litmus' by default)
kubectl -n litmus get chaosresult cnpg-jepsen-chaos-pod-delete \
   -o jsonpath='{.status.experimentStatus.verdict}'

🔗 References & more docs

CNPG Playground: https://github.com/cloudnative-pg/cnpg-playground
CloudNativePG Installation & Upgrades: https://cloudnative-pg.io/documentation/current/installation_upgrade/
Litmus Helm chart: https://github.com/litmuschaos/litmus-helm/
kube-prometheus-stack: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
CNPG Grafana dashboards: https://github.com/cloudnative-pg/grafana-dashboards
License: Apache 2.0 (see LICENSE).

🔧 Monitoring and Observability Tools

Real-time Monitoring Script

Watch CNPG pods, chaos engines, and cluster events during experiments:

# Monitor pod deletions and failovers in real-time
bash scripts/monitor-cnpg-pods.sh <cluster-name> <cnpg-namespace> <chaos-namespace> <kube-context>

# Example
bash scripts/monitor-cnpg-pods.sh pg-eu default litmus kind-k8s-eu

What it shows:

CNPG pod status with role labels (primary/replica)
Active ChaosEngines in the chaos namespace
Recent Kubernetes events (pod deletions, promotions, etc.)
Updates every 2 seconds

📚 Additional Resources

CNPG Documentation: https://cloudnative-pg.io/documentation/
Litmus Documentation: https://docs.litmuschaos.io/
Jepsen Documentation: https://jepsen.io/
PostgreSQL High Availability: https://www.postgresql.org/docs/current/high-availability.html

Follow the sections above to execute chaos tests. Review the logs for analysis, and consult the /archive directory for additional documentation if needed.