fix: retry redis conn in the bg and fail open otherwise by Flo4604 · Pull Request #5377 · unkeyed/unkey

What does this PR do?

Adds resilient middleware engine handling to prevent service failures when Redis is unavailable. The middleware engine now fails closed with a 503 Service Unavailable response when Redis connectivity is lost, and automatically retries connection in the background with exponential backoff.

Introduces a new ResilientEvaluator wrapper that atomically swaps between unavailable and working engine states. When Redis is configured but fails to connect, the service returns 503 errors instead of crashing, and continues attempting to reconnect until successful.

Adds a new error code EngineUnavailable with appropriate HTTP status mapping and Prometheus metrics for monitoring engine unavailability events.

Fixes #5365

Type of change

  • Enhancement (small improvements)
  • Bug fix (non-breaking change which fixes an issue)
  • Chore (refactoring code, technical debt, workflow improvements)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How should this be tested?

  • Start sentinel service with Redis URL configured but Redis server unavailable
  • Verify requests return 503 with "middleware engine temporarily unavailable" message
  • Start Redis server and verify engine automatically recovers
  • Monitor sentinel_engine_unavailable_total Prometheus metric during unavailability
  • Test with empty Redis URL to ensure pass-through mode still works

Checklist

Required

  • Filled out the "How to test" section in this PR
  • Read Contributing Guide
  • Self-reviewed my own code
  • Commented on my code in hard-to-understand areas
  • Ran pnpm build
  • Ran pnpm fmt
  • Ran make fmt on /go directory
  • Checked for warnings, there are none
  • Removed all console.logs
  • Merged the latest changes from main onto my branch with git pull origin main
  • My changes don't cause any responsiveness issues

Appreciated

  • If a UI change was made: Added a screen recording or screenshots to this PR
  • Updated the Unkey Docs if changes were necessary