Skip to content

feat(keeper): implement automated disaster recovery and multi-region failover#495

Merged
ayomideadeniran merged 3 commits into
SoroLabs:mainfrom
d3vobed:feat/issue-452-dr-failover
Jun 4, 2026
Merged

feat(keeper): implement automated disaster recovery and multi-region failover#495
ayomideadeniran merged 3 commits into
SoroLabs:mainfrom
d3vobed:feat/issue-452-dr-failover

Conversation

@d3vobed
Copy link
Copy Markdown

@d3vobed d3vobed commented May 30, 2026

Summary

Implements an automated disaster recovery and failover system for the keeper service with multi-region RPC endpoint support.

What was added

  • New MultiRegionRPCClient failover layer:
    • active endpoint routing
    • automatic endpoint fallback on RPC failure
    • endpoint quarantine with configurable cooldown
    • background health checks and endpoint recovery
  • Keeper startup integration in keeper/index.js:
    • uses multi-region failover client as the primary RPC abstraction
    • exposes live failover state to metrics/health
    • lifecycle shutdown handling for failover manager
  • Configuration additions in keeper/src/config.js:
    • SOROBAN_RPC_URLS
    • RPC_FAILOVER_ENABLED
    • RPC_FAILOVER_FAILURE_THRESHOLD
    • RPC_FAILOVER_COOLDOWN_MS
    • RPC_FAILOVER_HEALTH_CHECK_INTERVAL_MS
  • Observability enhancements in keeper/src/metrics.js:
    • failover counters and gauges
    • failover state in /health and /metrics
    • Prometheus failover metrics
  • Documentation:
    • keeper/docs/disaster-recovery-failover.md
    • updates in keeper/README.md and keeper/.env.example
  • Tests:
    • keeper/__tests__/disasterRecovery.test.js
    • extended keeper/__tests__/metrics.test.js

Acceptance criteria mapping

  • Feature implementation: ✅
  • Error tracking/fallback behavior: ✅
  • Infrastructure integration: ✅
  • Documentation: ✅
  • Unit coverage for failover paths: ✅

Validation notes

  • Static diagnostics reported no file errors for modified files.
  • Runtime tests could not be executed in this container because Node/npm are unavailable and package installation is not permitted in this environment.

Closes #452

@drips-wave
Copy link
Copy Markdown

drips-wave Bot commented May 30, 2026

@d3vobed Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@d3vobed d3vobed force-pushed the feat/issue-452-dr-failover branch from 6ec3471 to 680ccd8 Compare May 31, 2026 21:59
@ayomideadeniran
Copy link
Copy Markdown
Contributor

pr under review, if i find any wrong implementation i wlll drop the reveiw

@ayomideadeniran ayomideadeniran marked this pull request as ready for review June 4, 2026 14:07
@ayomideadeniran ayomideadeniran merged commit 7946b9c into SoroLabs:main Jun 4, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Backend] Implement Automated Disaster Recovery and Failover System

2 participants