Skip to content

fix(l1): snap sync pivot update crashes due to peer selection bugs #6474

@ElFantasma

Description

@ElFantasma

Motivation

Snap sync fails ~20% of the time on mainnet and deterministically on small networks (hoodi) due to bugs in the pivot-update peer selection path. The failure mode is always the same: update_pivot exhausts its 15-failure budget by repeatedly asking the same 1–2 peers, then exits the process as "irrecoverable." The crash also leaves the DB in an inconsistent state, requiring a full removedb and resync from scratch.

Three independent bugs compound to produce this:

Bug A — weight_peer systematically selects the wrong peer

The weight function used by get_best_peer is:

fn weight_peer(&self, score: &i64, requests: &i64) -> i64 {
    score * SCORE_WEIGHT - requests * REQUESTS_WEIGHT  // = score - inflight_requests
}

During snap sync, healthy peers accumulate 40–50 inflight snap requests each, tanking their weight to ~3. An idle peer with 0 requests (e.g., an eth/70-only erigon node that doesn't support snap) gets weight ~47 and is selected every time — even though it can't answer the query.

Confirmed on mainnet run run_20260414_011127 (53 peers). The stuck peer 0x32ea…14be was erigon/v3.5.0-dev with eth/70 only (no snap, no eth/68/69). Score 47, requests 0, weight 47. Eleven healthy Geth peers had score 50 but requests ~47 each → weight 3. get_best_peer returned the idle erigon peer every time.

This bug gets WORSE on larger networks — more peers means more total inflight requests per peer, making the weight gap between idle and busy peers even larger.

The core issue is that inflight snap data requests and control-plane requests (pivot update, header resolution) share the same weight function. Snap inflight count should not penalize a peer's eligibility for control-plane requests, but at the same time we need a mechanism to distribute control-plane requests across eligible peers rather than always picking the same one.

Bug B — Deterministic peer selection never rotates

get_best_peer uses .max_by_key(weight_peer) which deterministically returns the same peer when weights are stable. Its "skip same peer" logic only alternates within the top-N. If the top-N peers don't have the requested block, every other peer in the table is never tried.

Confirmed on hoodi run run_20260411_033943 (13 peers). Only 2 distinct peers were tried across 9 real attempts. At least 6 other peers had BlockRangeUpdate.range_to >= 2593928 (the requested block) — visible in the logs during the retry window — and were never asked.

Two independent surfaces:

  • get_peer_connections returns peers in HashMap iteration order (stable). request_block_headers does take(MAX_PEERS_TO_ASK) (5) and retries 3× → same 5 peers every retry.
  • get_best_peer deterministically returns the top-scored peer. On small networks the alternation window is pathologically narrow.

Bug C — Available data not used for peer filtering

Peers send BlockRangeUpdate messages advertising their chain tip. This data is received and stored but never used for peer selection. On hoodi, the stuck peer 0xeedf…df38 sent a BlockRangeUpdate with range_to=259388642 blocks behind the requested pivot 2593928. We kept asking it anyway.

Similarly, on mainnet, the erigon peer likely hadn't synced to the requested pivot block. Filtering by range_to >= requested_block would have eliminated both failures immediately.

Compounding factors

"Waiting for different peer" counts as failures

Between real request attempts, the retry loop increments total_failures for passive wait cycles ("peer X failed 3 times, waiting for a different peer"). On hoodi, 6 of 15 failures were passive waits. On mainnet, 11 of 15. This burns 40–73% of the failure budget without making any actual requests.

Post-pivot header fetch has the same bugs

After update_pivot succeeds, snap_sync.rs calls request_block_headers(block_number + 1, pivot.hash()) which uses ask_peer_head_number → same deterministic top-5 peers. Peers respond with EmptyResponseFromPeer (they don't have the just-updated pivot yet), returns NoBlockHeaders → classified as irrecoverable → process::exit(2). This was the failure mode in mainnet run #7.

DB corruption on crash

process::exit(2) skips cleanup. On restart: Error: Unknown state found in DB. Please run 'ethrex removedb'. Every sync crash requires a full resync from scratch.

Evidence

Hoodi failure (run_20260411_033943)

13 peers total. Failed requesting pivot block 2593928.

Time Peer Score change total_failures
03:50:51–03:50:55 0xeedf…df38 (Nimbus) 0→−3 0→2
03:50:59–03:52:11 (80s waiting, no real requests) 2→8
03:52:11–03:52:14 0x501b…7bc8 (Nethermind) 0→−3 8→10
03:52:17 Nethermind disconnects (TooManyPeers)
03:52:18–03:52:21 0xeedf…df38 (back to same) −3→−6 11→13
03:52:25 process::exit(2)

Only 2 distinct peers tried. 6+ peers with the requested block were never asked.

Additional context: 0 instances of our-side peer rejection, 1,818 instances of remote "Too Many Peers" disconnections. On hoodi we can't grow the peer pool beyond ~13 — most nodes are already full. This makes peer-selection breadth critical: the pool we have is all we'll get.

Mainnet failure (run_20260414_011127)

53 peers. Failed requesting pivot block 24874945. The stuck peer 0x32ea…14be (erigon, eth/70 only) had score 47, requests 0, weight 47. Eleven healthy Geth peers had score 50, requests ~47, weight 3.

3 real requests to erigon, 11 "waiting for different peer" cycles (~180s wasted). Two other peers had range_to=24874945 at 02:09 — 5 minutes before the crash.

After crash: Error: Unknown state found in DB. Please run 'ethrex removedb'.

Proposed fixes (priority order)

1. Don't let snap inflight count dominate control-plane peer selection

The weight function score - inflight_requests conflates data-plane load (snap fetches) with control-plane eligibility (pivot updates, header lookups). During snap sync, healthy peers carry 40–50 inflight requests, making them look worse than idle/incapable peers.

Options:

  • Separate weight for control-plane requests that ignores (or caps) the snap inflight penalty. E.g., weight = score - min(requests, 5) so weight never drops below score - 5.
  • Use get_random_peer for pivot updates — pivot updates are rare (~1 per sync cycle); random selection among eligibles avoids the determinism bug entirely.
  • Weight by request success rate instead of raw inflight count — a peer handling 50 requests and succeeding is better than one handling 0.

Note: while snap inflight shouldn't penalize control-plane selection, we still need to distribute control-plane load across eligible peers rather than always converging on one. Any solution should ensure that among equally-eligible peers, requests are spread (randomization, round-robin, or similar).

2. Filter peers by BlockRangeUpdate.range_to

Before selecting a peer for update_pivot or header requests, filter out peers whose last BlockRangeUpdate.range_to < requested_block_number. The data is already received and stored — just not used in selection.

3. Broaden peer rotation

get_best_peer must not return the same peer after repeated failures. Options (combinable):

  • Track which peers were already tried this cycle and exclude them
  • After N consecutive failures on a peer, blacklist it for this call
  • On small networks, ensure every eligible peer is tried before declaring failure
  • Add jitter/randomization to break deterministic .max_by_key ties

4. Don't count passive waits as failures

Only increment total_failures on actual peer requests, not on "waiting for a different peer" cycles. The current behavior burns 40–73% of the failure budget without doing useful work.

5. Fix post-pivot header fetch

The header fetch after update_pivot succeeds uses the same deterministic peer selection. Wrap in a retry loop with peer rotation (overlaps with #3). For the case where the new pivot hasn't propagated to most peers yet: consider choosing a slightly older pivot (e.g., latest - 64 instead of latest).

6. Reclassify pivot-update failures as recoverable

update_pivot returning PeerHandlerError::BlockHeaders maps to SyncError::PeerHandler → classified as irrecoverable around sync.rs:258-259process::exit(2). This should be retryable, not fatal. With fixes #1#4 landing, the failure becomes much rarer, but when it does happen it should loop back and try again after a cooldown.

7. Raise or rethink MAX_TOTAL_FAILURES = 15

Less critical once #1#4 land. The cap is low enough that 15 failures can be exhausted in 2–3 minutes. Options: raise to 50+, make it per-peer-cycle instead of global, or remove entirely if irrecoverable classification is fixed (#6).

8. Graceful shutdown / DB resilience on sync failure

process::exit(2) leaves the DB in an inconsistent state. Either:

  • Clean up temp sync state before exiting (flush/remove incomplete data)
  • Make startup resilient to partial sync state (detect + resume, or detect + clean automatically)

This is independent of the peer selection bugs but compounds the impact — every crash currently requires a full resync.

Related work

Commits on fullsync-acceleration / fullsync-improvement branches address some of these:

  • 880244afe — weighted peer selection, record failure/success on header download, more fetch attempts, faster timeout
  • efaa344d4 — reset failure counter on success (fail only on consecutive failures, not total)

These should be evaluated for cherry-picking or porting.

Debugging tools

PR #6470 (feat(l1): add snap sync observability endpoints and tooling) adds tooling specifically designed to debug this class of issue:

  • admin_syncStatus RPC endpoint — live sync phase, pivot block, progress metrics
  • admin_peerScores RPC endpoint — per-peer scores, inflight request counts, capabilities
  • tooling/sync/peer_top.py — live TUI showing peer scores, request distribution, and selection patterns in real time
  • Grafana dashboard panels for sync progress, peer scoring distribution, and request rates
  • Header-download diagnostics logging in snap_sync.rs
  • Degradation detection with automatic TRACE escalation

These tools were used to produce the forensic analysis in this issue and should be merged first to enable verification of fixes.

Why it fails ~20% of the time (not always)

Bug A (the MAX_TOTAL_FAILURES = 15 irrecoverable classification) was likely introduced as a regression by #6394 (fix(l1): p2p sync stall fixes and discovery hardening, merged 2026-03-25). Before that PR, update_pivot used an infinite retry loop with 1s backoff. #6394 changed it to a finite failure budget with exponential backoff — the intent was to prevent infinite stalls, but the low cap combined with Bugs B and C turned transient failures into fatal crashes.

Bug B (deterministic peer selection) and Bug C (unused BlockRangeUpdate data) predate #6394, but were masked because the infinite retry loop eventually found a working peer by exhaustion. The regression in #6394 made the retry budget finite without fixing the underlying peer selection issues, turning a slow-but-working path into a crash.

Runs where all phases complete fast enough never need a pivot update → never hit the buggy path. Slower runs (larger state, slower peers, network congestion) trigger pivot staleness → update_pivot → the bugs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    L1Ethereum client

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions