Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions keeper/.env.example
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
# Soroban Network Configuration
SOROBAN_RPC_URL=https://rpc-futurenet.stellar.org
# Optional comma-separated regional RPC endpoints for automated failover.
# SOROBAN_RPC_URLS=https://rpc-us.example.com,https://rpc-eu.example.com
NETWORK_PASSPHRASE=Test SDF Future Network ; October 2022

# Multi-region failover tuning
RPC_FAILOVER_ENABLED=true
RPC_FAILOVER_FAILURE_THRESHOLD=3
RPC_FAILOVER_COOLDOWN_MS=30000
RPC_FAILOVER_HEALTH_CHECK_INTERVAL_MS=15000

# Keeper Account
KEEPER_SECRET=S...

Expand Down
18 changes: 9 additions & 9 deletions keeper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,15 +211,9 @@ RESOLVER_FAILURE_MODE=skip
- **`P2P_ENABLED` / `P2P_SHARED_SECRET`**: Enables signed peer discovery and load-aware ownership. See [P2P Keeper Discovery](./docs/p2p-keeper-discovery.md).
- **`P2P_PUBLIC_URL` / `P2P_BOOTSTRAP_PEERS`**: Advertised peer URL and initial peer list used to join the keeper mesh.
- **`DRIFT_WARNING_SECONDS` / `DRIFT_CRITICAL_SECONDS`**: Thresholds for recurring execution drift classification.
- **`RESOLVER_FUNCTIONS_CONFIG`**: Optional JSON file mapping task resolver IDs to sandboxed JS/WASM functions.
- **`RESOLVER_DEFAULT_TIMEOUT_MS`**: Default per-invocation resolver timeout.
- **`RESOLVER_FAILURE_MODE`**: `skip` fails closed on resolver errors; `allow` fails open for controlled migrations.

## Serverless Resolvers

The keeper can evaluate custom JavaScript or WASM resolver logic before enqueueing a due task. Resolvers run after interval and gas checks, inside a bounded runtime with static capability checks, payload size limits, and per-call timeouts. Tasks without resolver IDs are unchanged.

See [Serverless Resolver Runtime](./docs/serverless-resolvers.md) for configuration, authoring examples, and the security model.
- **`SOROBAN_RPC_URLS` / `RPC_FAILOVER_ENABLED`**: Optional multi-region RPC list and failover toggle (automatically enabled when multiple URLs are configured).
- **`RPC_FAILOVER_FAILURE_THRESHOLD` / `RPC_FAILOVER_COOLDOWN_MS`**: Endpoint quarantine policy after repeated failures.
- **`RPC_FAILOVER_HEALTH_CHECK_INTERVAL_MS`**: Background health-probe interval used for endpoint recovery and rebalancing.

## RPC Load Balancer Configuration

Expand Down Expand Up @@ -271,6 +265,12 @@ The optional P2P layer lets keepers discover each other, advertise load, and spl

See [P2P Keeper Discovery](./docs/p2p-keeper-discovery.md) for setup, security review notes, health fields, and failure behavior.

## Disaster Recovery and Failover

The keeper supports automated multi-region RPC failover for disaster recovery. Configure a primary endpoint in `SOROBAN_RPC_URL` and additional regions in `SOROBAN_RPC_URLS`.

See [Disaster Recovery and Failover Guide](./docs/disaster-recovery-failover.md) for architecture, metrics, and operational runbooks.

### Dead-Letter Queue Configuration

The keeper includes a Dead-Letter Queue (DLQ) for handling repeatedly failing tasks:
Expand Down
108 changes: 108 additions & 0 deletions keeper/__tests__/disasterRecovery.test.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
const { MultiRegionRPCClient } = require('../src/disasterRecovery');

describe('MultiRegionRPCClient', () => {
function createFakeServerFactory(handlersByUrl) {
return (url) => {
const handlers = handlersByUrl[url] || {};
return {
serverURL: { toString: () => url },
getNetwork: handlers.getNetwork || (async () => ({ passphrase: 'test' })),
getHealth: handlers.getHealth || (async () => ({ status: 'healthy' })),
getLatestLedger: handlers.getLatestLedger || (async () => ({ sequence: 1 })),
};
};
}

test('uses active endpoint when healthy', async () => {
const client = new MultiRegionRPCClient(['https://a.example', 'https://b.example'], {
serverFactory: createFakeServerFactory({
'https://a.example': {
getNetwork: async () => ({ passphrase: 'A' }),
},
}),
});

const server = client.getServerFacade();
const result = await server.getNetwork();

expect(result.passphrase).toBe('A');
expect(client.getStateSnapshot().activeRegion).toContain('a.example');
});

test('fails over to secondary endpoint after primary failure', async () => {
const metrics = { increment: jest.fn(), updateFailoverState: jest.fn() };
const client = new MultiRegionRPCClient(['https://a.example', 'https://b.example'], {
metrics,
failureThreshold: 1,
serverFactory: createFakeServerFactory({
'https://a.example': {
getNetwork: async () => {
throw new Error('primary down');
},
},
'https://b.example': {
getNetwork: async () => ({ passphrase: 'B' }),
},
}),
});

const server = client.getServerFacade();
const result = await server.getNetwork();

expect(result.passphrase).toBe('B');
expect(client.getStateSnapshot().activeRegion).toContain('b.example');
expect(metrics.increment).toHaveBeenCalledWith('failoverEventsTotal', 1);
expect(metrics.increment).toHaveBeenCalledWith('failoverSwitchesTotal', 1);
});

test('throws structured error if all endpoints fail', async () => {
const client = new MultiRegionRPCClient(['https://a.example', 'https://b.example'], {
failureThreshold: 1,
serverFactory: createFakeServerFactory({
'https://a.example': {
getNetwork: async () => {
throw new Error('a down');
},
},
'https://b.example': {
getNetwork: async () => {
throw new Error('b down');
},
},
}),
});

const server = client.getServerFacade();

await expect(server.getNetwork()).rejects.toMatchObject({
code: 'RPC_MULTI_REGION_FAILOVER_EXHAUSTED',
});
});

test('health checks recover endpoint after cooldown', async () => {
let failHealth = true;
const client = new MultiRegionRPCClient(['https://a.example', 'https://b.example'], {
failureThreshold: 1,
cooldownMs: 1,
serverFactory: createFakeServerFactory({
'https://a.example': {
getHealth: async () => {
if (failHealth) {
throw new Error('unhealthy');
}
return { status: 'healthy' };
},
},
}),
});

await client.runHealthCheck();
expect(client.getStateSnapshot().endpoints[0].unavailable).toBe(true);

failHealth = false;
await new Promise((resolve) => setTimeout(resolve, 5));
await client.runHealthCheck();

expect(client.getStateSnapshot().endpoints[0].unavailable).toBe(false);
});
});
107 changes: 15 additions & 92 deletions keeper/__tests__/metrics.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -38,95 +38,18 @@ describe('Metrics', () => {
expect(typeof snapshot).toBe('object');
});

describe('SLO configuration', () => {
it('should set and get SLO thresholds', () => {
metrics.setSloThreshold('pollFreshness', 30000);
expect(metrics.getSloThreshold('pollFreshness')).toBe(30000);
metrics.setSloThreshold('executionTimeliness', 120000);
expect(metrics.getSloThreshold('executionTimeliness')).toBe(120000);
});

it('should have default SLO thresholds', () => {
expect(metrics.sloThresholds.pollFreshnessMs).toBe(60000);
});
});

describe('SLO counters', () => {
it('should increment poll freshness SLO success', () => {
metrics.increment('pollFreshnessSloSuccess');
expect(metrics.counters.pollFreshnessSloSuccess).toBe(1);
});

it('should increment poll freshness SLO failure', () => {
metrics.increment('pollFreshnessSloFailure');
expect(metrics.counters.pollFreshnessSloFailure).toBe(1);
});

it('should increment execution timeliness SLO success', () => {
metrics.increment('executionTimelinessSloSuccess');
expect(metrics.counters.executionTimelinessSloSuccess).toBe(1);
});

it('should increment execution timeliness SLO failure', () => {
metrics.increment('executionTimelinessSloFailure');
expect(metrics.counters.executionTimelinessSloFailure).toBe(1);
});

it('should increment retries exhausted', () => {
metrics.increment('retriesExhausted');
expect(metrics.counters.retriesExhausted).toBe(1);
});

it('should increment retry attempts with outcome', () => {
metrics.increment('retryAttemptsTotal', { outcome: 'success' });
expect(metrics.counters.retryAttemptsTotal.success).toBe(1);
metrics.increment('retryAttemptsTotal', { outcome: 'failure' });
expect(metrics.counters.retryAttemptsTotal.failure).toBe(1);
metrics.increment('retryAttemptsTotal', { outcome: 'duplicate' });
expect(metrics.counters.retryAttemptsTotal.duplicate).toBe(1);
});
});

describe('SLO gauges', () => {
it('should record poll freshness seconds', () => {
metrics.record('pollFreshnessSeconds', 5);
expect(metrics.gauges.pollFreshnessSeconds).toBe(5);
});

it('should record oldest task age', () => {
metrics.record('oldestTaskAgeSeconds', 120);
expect(metrics.gauges.oldestTaskAgeSeconds).toBe(120);
});

it('should record retry queue size', () => {
metrics.record('retryQueueSize', 3);
expect(metrics.gauges.retryQueueSize).toBe(3);
});

it('should record SLO rates', () => {
metrics.record('pollFreshnessSloRate', 0.95);
metrics.record('executionTimelinessSloRate', 0.87);
expect(metrics.gauges.pollFreshnessSloRate).toBe(0.95);
expect(metrics.gauges.executionTimelinessSloRate).toBe(0.87);
});
});

describe('SLO rate computation', () => {
it('should compute SLO rates from counters', () => {
// Initially no observations
expect(metrics.gauges.pollFreshnessSloRate).toBe(0);
metrics.increment('pollFreshnessSloSuccess');
metrics.increment('pollFreshnessSloSuccess');
metrics.increment('pollFreshnessSloFailure');
// 2 success / 3 total = 0.666...
expect(metrics.gauges.pollFreshnessSloRate).toBeCloseTo(2/3, 5);
});
});

describe('setPollIntervalMs', () => {
it('should update poll interval', () => {
metrics.setPollIntervalMs(45000);
expect(metrics.pollIntervalMs).toBe(45000);
});
});
});
it('should store failover state in snapshot', () => {
metrics.updateFailoverState({
activeIndex: 1,
activeRegion: 'us-west',
healthyEndpoints: 2,
totalEndpoints: 3,
endpoints: [{ index: 1, region: 'us-west', unavailable: false }],
});

const snapshot = metrics.snapshot();
expect(snapshot.failover.activeIndex).toBe(1);
expect(snapshot.failover.activeRegion).toBe('us-west');
expect(snapshot.failover.totalEndpoints).toBe(3);
});
});
76 changes: 76 additions & 0 deletions keeper/docs/disaster-recovery-failover.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Automated Disaster Recovery and Multi-Region Failover

This document describes the keeper's automated disaster recovery path for Soroban RPC outages.

## Goals

- Keep task polling and execution available when a primary region fails.
- Detect endpoint degradation early and route traffic to healthy regions.
- Emit explicit operational signals for failover events and endpoint health.

## Architecture

The keeper now uses `MultiRegionRPCClient` as the RPC access layer.

- Endpoints are configured as an ordered region list.
- One endpoint is marked active at any point in time.
- Calls are attempted on the active endpoint first.
- On failure, calls automatically retry across the remaining healthy endpoints.
- Repeated failures mark an endpoint unavailable for a cooldown window.
- Background health checks periodically recover endpoints and may rebalance active routing.

## Configuration

Required:

- `SOROBAN_RPC_URL`: primary RPC URL (backward-compatible default)

Optional failover controls:

- `SOROBAN_RPC_URLS`: comma-separated list of additional RPC URLs (multi-region)
- `RPC_FAILOVER_ENABLED`: `true|false` (default: enabled when more than one URL exists)
- `RPC_FAILOVER_FAILURE_THRESHOLD`: consecutive failures before endpoint quarantine (default: `3`)
- `RPC_FAILOVER_COOLDOWN_MS`: endpoint cooldown window (default: `30000`)
- `RPC_FAILOVER_HEALTH_CHECK_INTERVAL_MS`: background probe interval (default: `15000`)

## Observability

The metrics and health endpoints include failover state.

JSON endpoints (`/health`, `/metrics`):

- Active region and endpoint index
- Healthy endpoint count versus total
- Per-endpoint status and failure metadata

Prometheus metrics:

- `keeper_rpc_failover_events_total`
- `keeper_rpc_failover_switches_total`
- `keeper_rpc_failover_active_endpoint_index`
- `keeper_rpc_failover_healthy_endpoints`
- `keeper_rpc_failover_total_endpoints`

## Failure and Recovery Flow

1. Active region fails an RPC call.
2. Failure counters increase and endpoint score decreases.
3. If threshold is exceeded, endpoint becomes unavailable for cooldown.
4. Request is retried against alternate regions in-priority order.
5. Successful alternate response updates active endpoint selection.
6. Background health checks recover previously unavailable regions.

## Security Notes

- Keep all RPC URLs on trusted infrastructure and private networking where possible.
- Use TLS endpoints only for production traffic.
- Rotate keeper secrets independently from region failover operations.

## Testing

Unit tests for failover behavior are in `keeper/__tests__/disasterRecovery.test.js` and cover:

- Healthy primary path
- Automatic cross-region failover
- Exhausted failover error path
- Endpoint recovery via health checks
Loading
Loading