fix(l1): snap sync pivot update crashes due to peer selection bugs

## Motivation

Snap sync fails ~20% of the time on mainnet and deterministically on small networks (hoodi) due to bugs in the pivot-update peer selection path. The failure mode is always the same: `update_pivot` exhausts its 15-failure budget by repeatedly asking the same 1–2 peers, then exits the process as "irrecoverable." The crash also leaves the DB in an inconsistent state, requiring a full `removedb` and resync from scratch.

Three independent bugs compound to produce this:

## Bug A — `weight_peer` systematically selects the wrong peer

The weight function used by `get_best_peer` is:

```rust
fn weight_peer(&self, score: &i64, requests: &i64) -> i64 {
    score * SCORE_WEIGHT - requests * REQUESTS_WEIGHT  // = score - inflight_requests
}
```

During snap sync, healthy peers accumulate 40–50 inflight snap requests each, tanking their weight to ~3. An idle peer with 0 requests (e.g., an eth/70-only erigon node that doesn't support snap) gets weight ~47 and is selected every time — even though it can't answer the query.

**Confirmed on mainnet run `run_20260414_011127`** (53 peers). The stuck peer `0x32ea…14be` was erigon/v3.5.0-dev with eth/70 only (no snap, no eth/68/69). Score 47, requests 0, weight 47. Eleven healthy Geth peers had score 50 but requests ~47 each → weight 3. `get_best_peer` returned the idle erigon peer every time.

**This bug gets WORSE on larger networks** — more peers means more total inflight requests per peer, making the weight gap between idle and busy peers even larger.

The core issue is that inflight snap data requests and control-plane requests (pivot update, header resolution) share the same weight function. Snap inflight count should not penalize a peer's eligibility for control-plane requests, but at the same time we need a mechanism to distribute control-plane requests across eligible peers rather than always picking the same one.

## Bug B — Deterministic peer selection never rotates

`get_best_peer` uses `.max_by_key(weight_peer)` which deterministically returns the same peer when weights are stable. Its "skip same peer" logic only alternates within the top-N. If the top-N peers don't have the requested block, every other peer in the table is never tried.

**Confirmed on hoodi run `run_20260411_033943`** (13 peers). Only 2 distinct peers were tried across 9 real attempts. At least 6 other peers had `BlockRangeUpdate.range_to >= 2593928` (the requested block) — visible in the logs during the retry window — and were never asked.

Two independent surfaces:
- `get_peer_connections` returns peers in HashMap iteration order (stable). `request_block_headers` does `take(MAX_PEERS_TO_ASK)` (5) and retries 3× → same 5 peers every retry.
- `get_best_peer` deterministically returns the top-scored peer. On small networks the alternation window is pathologically narrow.

## Bug C — Available data not used for peer filtering

Peers send `BlockRangeUpdate` messages advertising their chain tip. This data is received and stored but **never used for peer selection**. On hoodi, the stuck peer `0xeedf…df38` sent a `BlockRangeUpdate` with `range_to=2593886` — **42 blocks behind** the requested pivot 2593928. We kept asking it anyway.

Similarly, on mainnet, the erigon peer likely hadn't synced to the requested pivot block. Filtering by `range_to >= requested_block` would have eliminated both failures immediately.

## Compounding factors

### "Waiting for different peer" counts as failures

Between real request attempts, the retry loop increments `total_failures` for passive wait cycles ("peer X failed 3 times, waiting for a different peer"). On hoodi, 6 of 15 failures were passive waits. On mainnet, 11 of 15. This burns 40–73% of the failure budget without making any actual requests.

### Post-pivot header fetch has the same bugs

After `update_pivot` succeeds, `snap_sync.rs` calls `request_block_headers(block_number + 1, pivot.hash())` which uses `ask_peer_head_number` → same deterministic top-5 peers. Peers respond with `EmptyResponseFromPeer` (they don't have the just-updated pivot yet), returns `NoBlockHeaders` → classified as irrecoverable → `process::exit(2)`. This was the failure mode in mainnet run #7.

### DB corruption on crash

`process::exit(2)` skips cleanup. On restart: `Error: Unknown state found in DB. Please run 'ethrex removedb'`. Every sync crash requires a full resync from scratch.

## Evidence

### Hoodi failure (`run_20260411_033943`)

13 peers total. Failed requesting pivot block 2593928.

| Time | Peer | Score change | total_failures |
|------|------|-------------|----------------|
| 03:50:51–03:50:55 | `0xeedf…df38` (Nimbus) | 0→−3 | 0→2 |
| 03:50:59–03:52:11 | (80s waiting, no real requests) | — | 2→8 |
| 03:52:11–03:52:14 | `0x501b…7bc8` (Nethermind) | 0→−3 | 8→10 |
| 03:52:17 | Nethermind disconnects (TooManyPeers) | — | — |
| 03:52:18–03:52:21 | `0xeedf…df38` (back to same) | −3→−6 | 11→13 |
| 03:52:25 | **process::exit(2)** | — | — |

Only 2 distinct peers tried. 6+ peers with the requested block were never asked.

Additional context: 0 instances of our-side peer rejection, 1,818 instances of remote "Too Many Peers" disconnections. On hoodi we can't grow the peer pool beyond ~13 — most nodes are already full. This makes peer-selection breadth critical: the pool we have is all we'll get.

### Mainnet failure (`run_20260414_011127`)

53 peers. Failed requesting pivot block 24874945. The stuck peer `0x32ea…14be` (erigon, eth/70 only) had score 47, requests 0, weight 47. Eleven healthy Geth peers had score 50, requests ~47, weight 3.

3 real requests to erigon, 11 "waiting for different peer" cycles (~180s wasted). Two other peers had `range_to=24874945` at 02:09 — 5 minutes before the crash.

After crash: `Error: Unknown state found in DB. Please run 'ethrex removedb'`.

## Proposed fixes (priority order)

### 1. Don't let snap inflight count dominate control-plane peer selection

The weight function `score - inflight_requests` conflates data-plane load (snap fetches) with control-plane eligibility (pivot updates, header lookups). During snap sync, healthy peers carry 40–50 inflight requests, making them look worse than idle/incapable peers.

**Options:**
- **Separate weight for control-plane requests** that ignores (or caps) the snap inflight penalty. E.g., `weight = score - min(requests, 5)` so weight never drops below `score - 5`.
- **Use `get_random_peer` for pivot updates** — pivot updates are rare (~1 per sync cycle); random selection among eligibles avoids the determinism bug entirely.
- **Weight by request success rate** instead of raw inflight count — a peer handling 50 requests and succeeding is better than one handling 0.

Note: while snap inflight shouldn't penalize control-plane selection, we still need to distribute control-plane load across eligible peers rather than always converging on one. Any solution should ensure that among equally-eligible peers, requests are spread (randomization, round-robin, or similar).

### 2. Filter peers by `BlockRangeUpdate.range_to`

Before selecting a peer for `update_pivot` or header requests, filter out peers whose last `BlockRangeUpdate.range_to < requested_block_number`. The data is already received and stored — just not used in selection.

### 3. Broaden peer rotation

`get_best_peer` must not return the same peer after repeated failures. Options (combinable):
- Track which peers were already tried this cycle and exclude them
- After N consecutive failures on a peer, blacklist it for this call
- On small networks, ensure **every** eligible peer is tried before declaring failure
- Add jitter/randomization to break deterministic `.max_by_key` ties

### 4. Don't count passive waits as failures

Only increment `total_failures` on actual peer requests, not on "waiting for a different peer" cycles. The current behavior burns 40–73% of the failure budget without doing useful work.

### 5. Fix post-pivot header fetch

The header fetch after `update_pivot` succeeds uses the same deterministic peer selection. Wrap in a retry loop with peer rotation (overlaps with #3). For the case where the new pivot hasn't propagated to most peers yet: consider choosing a slightly older pivot (e.g., `latest - 64` instead of `latest`).

### 6. Reclassify pivot-update failures as recoverable

`update_pivot` returning `PeerHandlerError::BlockHeaders` maps to `SyncError::PeerHandler` → classified as irrecoverable around `sync.rs:258-259` → `process::exit(2)`. This should be retryable, not fatal. With fixes #1–#4 landing, the failure becomes much rarer, but when it does happen it should loop back and try again after a cooldown.

### 7. Raise or rethink `MAX_TOTAL_FAILURES = 15`

Less critical once #1–#4 land. The cap is low enough that 15 failures can be exhausted in 2–3 minutes. Options: raise to 50+, make it per-peer-cycle instead of global, or remove entirely if irrecoverable classification is fixed (#6).

### 8. Graceful shutdown / DB resilience on sync failure

`process::exit(2)` leaves the DB in an inconsistent state. Either:
- Clean up temp sync state before exiting (flush/remove incomplete data)
- Make startup resilient to partial sync state (detect + resume, or detect + clean automatically)

This is independent of the peer selection bugs but compounds the impact — every crash currently requires a full resync.

## Related work

Commits on `fullsync-acceleration` / `fullsync-improvement` branches address some of these:
- `880244afe` — weighted peer selection, record failure/success on header download, more fetch attempts, faster timeout
- `efaa344d4` — reset failure counter on success (fail only on consecutive failures, not total)

These should be evaluated for cherry-picking or porting.

## Debugging tools

PR #6470 (`feat(l1): add snap sync observability endpoints and tooling`) adds tooling specifically designed to debug this class of issue:
- `admin_syncStatus` RPC endpoint — live sync phase, pivot block, progress metrics
- `admin_peerScores` RPC endpoint — per-peer scores, inflight request counts, capabilities
- `tooling/sync/peer_top.py` — live TUI showing peer scores, request distribution, and selection patterns in real time
- Grafana dashboard panels for sync progress, peer scoring distribution, and request rates
- Header-download diagnostics logging in `snap_sync.rs`
- Degradation detection with automatic TRACE escalation

These tools were used to produce the forensic analysis in this issue and should be merged first to enable verification of fixes.

## Why it fails ~20% of the time (not always)

Bug A (the `MAX_TOTAL_FAILURES = 15` irrecoverable classification) was likely introduced as a regression by #6394 (`fix(l1): p2p sync stall fixes and discovery hardening`, merged 2026-03-25). Before that PR, `update_pivot` used an infinite retry loop with 1s backoff. #6394 changed it to a finite failure budget with exponential backoff — the intent was to prevent infinite stalls, but the low cap combined with Bugs B and C turned transient failures into fatal crashes.

Bug B (deterministic peer selection) and Bug C (unused BlockRangeUpdate data) predate #6394, but were masked because the infinite retry loop eventually found a working peer by exhaustion. The regression in #6394 made the retry budget finite without fixing the underlying peer selection issues, turning a slow-but-working path into a crash.

Runs where all phases complete fast enough never need a pivot update → never hit the buggy path. Slower runs (larger state, slower peers, network congestion) trigger pivot staleness → `update_pivot` → the bugs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(l1): snap sync pivot update crashes due to peer selection bugs #6474

Motivation

Bug A — `weight_peer` systematically selects the wrong peer

Bug B — Deterministic peer selection never rotates

Bug C — Available data not used for peer filtering

Compounding factors

"Waiting for different peer" counts as failures

Post-pivot header fetch has the same bugs

DB corruption on crash

Evidence

Hoodi failure (`run_20260411_033943`)

Mainnet failure (`run_20260414_011127`)

Proposed fixes (priority order)

1. Don't let snap inflight count dominate control-plane peer selection

2. Filter peers by `BlockRangeUpdate.range_to`

3. Broaden peer rotation

4. Don't count passive waits as failures

5. Fix post-pivot header fetch

6. Reclassify pivot-update failures as recoverable

7. Raise or rethink `MAX_TOTAL_FAILURES = 15`

8. Graceful shutdown / DB resilience on sync failure

Related work

Debugging tools

Why it fails ~20% of the time (not always)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time	Peer	Score change	total_failures
03:50:51–03:50:55	`0xeedf…df38` (Nimbus)	0→−3	0→2
03:50:59–03:52:11	(80s waiting, no real requests)	—	2→8
03:52:11–03:52:14	`0x501b…7bc8` (Nethermind)	0→−3	8→10
03:52:17	Nethermind disconnects (TooManyPeers)	—	—
03:52:18–03:52:21	`0xeedf…df38` (back to same)	−3→−6	11→13
03:52:25	process::exit(2)	—	—

fix(l1): snap sync pivot update crashes due to peer selection bugs #6474

Description

Motivation

Bug A — weight_peer systematically selects the wrong peer

Bug B — Deterministic peer selection never rotates

Bug C — Available data not used for peer filtering

Compounding factors

"Waiting for different peer" counts as failures

Post-pivot header fetch has the same bugs

DB corruption on crash

Evidence

Hoodi failure (run_20260411_033943)

Mainnet failure (run_20260414_011127)

Proposed fixes (priority order)

1. Don't let snap inflight count dominate control-plane peer selection

2. Filter peers by BlockRangeUpdate.range_to

3. Broaden peer rotation

4. Don't count passive waits as failures

5. Fix post-pivot header fetch

6. Reclassify pivot-update failures as recoverable

7. Raise or rethink MAX_TOTAL_FAILURES = 15

8. Graceful shutdown / DB resilience on sync failure

Related work

Debugging tools

Why it fails ~20% of the time (not always)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug A — `weight_peer` systematically selects the wrong peer

Hoodi failure (`run_20260411_033943`)

Mainnet failure (`run_20260414_011127`)

2. Filter peers by `BlockRangeUpdate.range_to`

7. Raise or rethink `MAX_TOTAL_FAILURES = 15`