Motivation
Snap sync fails ~20% of the time on mainnet and deterministically on small networks (hoodi) due to bugs in the pivot-update peer selection path. The failure mode is always the same: update_pivot exhausts its 15-failure budget by repeatedly asking the same 1–2 peers, then exits the process as "irrecoverable." The crash also leaves the DB in an inconsistent state, requiring a full removedb and resync from scratch.
Three independent bugs compound to produce this:
Bug A — weight_peer systematically selects the wrong peer
The weight function used by get_best_peer is:
fn weight_peer(&self, score: &i64, requests: &i64) -> i64 {
score * SCORE_WEIGHT - requests * REQUESTS_WEIGHT // = score - inflight_requests
}
During snap sync, healthy peers accumulate 40–50 inflight snap requests each, tanking their weight to ~3. An idle peer with 0 requests (e.g., an eth/70-only erigon node that doesn't support snap) gets weight ~47 and is selected every time — even though it can't answer the query.
Confirmed on mainnet run run_20260414_011127 (53 peers). The stuck peer 0x32ea…14be was erigon/v3.5.0-dev with eth/70 only (no snap, no eth/68/69). Score 47, requests 0, weight 47. Eleven healthy Geth peers had score 50 but requests ~47 each → weight 3. get_best_peer returned the idle erigon peer every time.
This bug gets WORSE on larger networks — more peers means more total inflight requests per peer, making the weight gap between idle and busy peers even larger.
The core issue is that inflight snap data requests and control-plane requests (pivot update, header resolution) share the same weight function. Snap inflight count should not penalize a peer's eligibility for control-plane requests, but at the same time we need a mechanism to distribute control-plane requests across eligible peers rather than always picking the same one.
Bug B — Deterministic peer selection never rotates
get_best_peer uses .max_by_key(weight_peer) which deterministically returns the same peer when weights are stable. Its "skip same peer" logic only alternates within the top-N. If the top-N peers don't have the requested block, every other peer in the table is never tried.
Confirmed on hoodi run run_20260411_033943 (13 peers). Only 2 distinct peers were tried across 9 real attempts. At least 6 other peers had BlockRangeUpdate.range_to >= 2593928 (the requested block) — visible in the logs during the retry window — and were never asked.
Two independent surfaces:
get_peer_connections returns peers in HashMap iteration order (stable). request_block_headers does take(MAX_PEERS_TO_ASK) (5) and retries 3× → same 5 peers every retry.
get_best_peer deterministically returns the top-scored peer. On small networks the alternation window is pathologically narrow.
Bug C — Available data not used for peer filtering
Peers send BlockRangeUpdate messages advertising their chain tip. This data is received and stored but never used for peer selection. On hoodi, the stuck peer 0xeedf…df38 sent a BlockRangeUpdate with range_to=2593886 — 42 blocks behind the requested pivot 2593928. We kept asking it anyway.
Similarly, on mainnet, the erigon peer likely hadn't synced to the requested pivot block. Filtering by range_to >= requested_block would have eliminated both failures immediately.
Compounding factors
"Waiting for different peer" counts as failures
Between real request attempts, the retry loop increments total_failures for passive wait cycles ("peer X failed 3 times, waiting for a different peer"). On hoodi, 6 of 15 failures were passive waits. On mainnet, 11 of 15. This burns 40–73% of the failure budget without making any actual requests.
Post-pivot header fetch has the same bugs
After update_pivot succeeds, snap_sync.rs calls request_block_headers(block_number + 1, pivot.hash()) which uses ask_peer_head_number → same deterministic top-5 peers. Peers respond with EmptyResponseFromPeer (they don't have the just-updated pivot yet), returns NoBlockHeaders → classified as irrecoverable → process::exit(2). This was the failure mode in mainnet run #7.
DB corruption on crash
process::exit(2) skips cleanup. On restart: Error: Unknown state found in DB. Please run 'ethrex removedb'. Every sync crash requires a full resync from scratch.
Evidence
Hoodi failure (run_20260411_033943)
13 peers total. Failed requesting pivot block 2593928.
| Time |
Peer |
Score change |
total_failures |
| 03:50:51–03:50:55 |
0xeedf…df38 (Nimbus) |
0→−3 |
0→2 |
| 03:50:59–03:52:11 |
(80s waiting, no real requests) |
— |
2→8 |
| 03:52:11–03:52:14 |
0x501b…7bc8 (Nethermind) |
0→−3 |
8→10 |
| 03:52:17 |
Nethermind disconnects (TooManyPeers) |
— |
— |
| 03:52:18–03:52:21 |
0xeedf…df38 (back to same) |
−3→−6 |
11→13 |
| 03:52:25 |
process::exit(2) |
— |
— |
Only 2 distinct peers tried. 6+ peers with the requested block were never asked.
Additional context: 0 instances of our-side peer rejection, 1,818 instances of remote "Too Many Peers" disconnections. On hoodi we can't grow the peer pool beyond ~13 — most nodes are already full. This makes peer-selection breadth critical: the pool we have is all we'll get.
Mainnet failure (run_20260414_011127)
53 peers. Failed requesting pivot block 24874945. The stuck peer 0x32ea…14be (erigon, eth/70 only) had score 47, requests 0, weight 47. Eleven healthy Geth peers had score 50, requests ~47, weight 3.
3 real requests to erigon, 11 "waiting for different peer" cycles (~180s wasted). Two other peers had range_to=24874945 at 02:09 — 5 minutes before the crash.
After crash: Error: Unknown state found in DB. Please run 'ethrex removedb'.
Proposed fixes (priority order)
1. Don't let snap inflight count dominate control-plane peer selection
The weight function score - inflight_requests conflates data-plane load (snap fetches) with control-plane eligibility (pivot updates, header lookups). During snap sync, healthy peers carry 40–50 inflight requests, making them look worse than idle/incapable peers.
Options:
- Separate weight for control-plane requests that ignores (or caps) the snap inflight penalty. E.g.,
weight = score - min(requests, 5) so weight never drops below score - 5.
- Use
get_random_peer for pivot updates — pivot updates are rare (~1 per sync cycle); random selection among eligibles avoids the determinism bug entirely.
- Weight by request success rate instead of raw inflight count — a peer handling 50 requests and succeeding is better than one handling 0.
Note: while snap inflight shouldn't penalize control-plane selection, we still need to distribute control-plane load across eligible peers rather than always converging on one. Any solution should ensure that among equally-eligible peers, requests are spread (randomization, round-robin, or similar).
2. Filter peers by BlockRangeUpdate.range_to
Before selecting a peer for update_pivot or header requests, filter out peers whose last BlockRangeUpdate.range_to < requested_block_number. The data is already received and stored — just not used in selection.
3. Broaden peer rotation
get_best_peer must not return the same peer after repeated failures. Options (combinable):
- Track which peers were already tried this cycle and exclude them
- After N consecutive failures on a peer, blacklist it for this call
- On small networks, ensure every eligible peer is tried before declaring failure
- Add jitter/randomization to break deterministic
.max_by_key ties
4. Don't count passive waits as failures
Only increment total_failures on actual peer requests, not on "waiting for a different peer" cycles. The current behavior burns 40–73% of the failure budget without doing useful work.
5. Fix post-pivot header fetch
The header fetch after update_pivot succeeds uses the same deterministic peer selection. Wrap in a retry loop with peer rotation (overlaps with #3). For the case where the new pivot hasn't propagated to most peers yet: consider choosing a slightly older pivot (e.g., latest - 64 instead of latest).
6. Reclassify pivot-update failures as recoverable
update_pivot returning PeerHandlerError::BlockHeaders maps to SyncError::PeerHandler → classified as irrecoverable around sync.rs:258-259 → process::exit(2). This should be retryable, not fatal. With fixes #1–#4 landing, the failure becomes much rarer, but when it does happen it should loop back and try again after a cooldown.
7. Raise or rethink MAX_TOTAL_FAILURES = 15
Less critical once #1–#4 land. The cap is low enough that 15 failures can be exhausted in 2–3 minutes. Options: raise to 50+, make it per-peer-cycle instead of global, or remove entirely if irrecoverable classification is fixed (#6).
8. Graceful shutdown / DB resilience on sync failure
process::exit(2) leaves the DB in an inconsistent state. Either:
- Clean up temp sync state before exiting (flush/remove incomplete data)
- Make startup resilient to partial sync state (detect + resume, or detect + clean automatically)
This is independent of the peer selection bugs but compounds the impact — every crash currently requires a full resync.
Related work
Commits on fullsync-acceleration / fullsync-improvement branches address some of these:
880244afe — weighted peer selection, record failure/success on header download, more fetch attempts, faster timeout
efaa344d4 — reset failure counter on success (fail only on consecutive failures, not total)
These should be evaluated for cherry-picking or porting.
Debugging tools
PR #6470 (feat(l1): add snap sync observability endpoints and tooling) adds tooling specifically designed to debug this class of issue:
admin_syncStatus RPC endpoint — live sync phase, pivot block, progress metrics
admin_peerScores RPC endpoint — per-peer scores, inflight request counts, capabilities
tooling/sync/peer_top.py — live TUI showing peer scores, request distribution, and selection patterns in real time
- Grafana dashboard panels for sync progress, peer scoring distribution, and request rates
- Header-download diagnostics logging in
snap_sync.rs
- Degradation detection with automatic TRACE escalation
These tools were used to produce the forensic analysis in this issue and should be merged first to enable verification of fixes.
Why it fails ~20% of the time (not always)
Bug A (the MAX_TOTAL_FAILURES = 15 irrecoverable classification) was likely introduced as a regression by #6394 (fix(l1): p2p sync stall fixes and discovery hardening, merged 2026-03-25). Before that PR, update_pivot used an infinite retry loop with 1s backoff. #6394 changed it to a finite failure budget with exponential backoff — the intent was to prevent infinite stalls, but the low cap combined with Bugs B and C turned transient failures into fatal crashes.
Bug B (deterministic peer selection) and Bug C (unused BlockRangeUpdate data) predate #6394, but were masked because the infinite retry loop eventually found a working peer by exhaustion. The regression in #6394 made the retry budget finite without fixing the underlying peer selection issues, turning a slow-but-working path into a crash.
Runs where all phases complete fast enough never need a pivot update → never hit the buggy path. Slower runs (larger state, slower peers, network congestion) trigger pivot staleness → update_pivot → the bugs.
Motivation
Snap sync fails ~20% of the time on mainnet and deterministically on small networks (hoodi) due to bugs in the pivot-update peer selection path. The failure mode is always the same:
update_pivotexhausts its 15-failure budget by repeatedly asking the same 1–2 peers, then exits the process as "irrecoverable." The crash also leaves the DB in an inconsistent state, requiring a fullremovedband resync from scratch.Three independent bugs compound to produce this:
Bug A —
weight_peersystematically selects the wrong peerThe weight function used by
get_best_peeris:During snap sync, healthy peers accumulate 40–50 inflight snap requests each, tanking their weight to ~3. An idle peer with 0 requests (e.g., an eth/70-only erigon node that doesn't support snap) gets weight ~47 and is selected every time — even though it can't answer the query.
Confirmed on mainnet run
run_20260414_011127(53 peers). The stuck peer0x32ea…14bewas erigon/v3.5.0-dev with eth/70 only (no snap, no eth/68/69). Score 47, requests 0, weight 47. Eleven healthy Geth peers had score 50 but requests ~47 each → weight 3.get_best_peerreturned the idle erigon peer every time.This bug gets WORSE on larger networks — more peers means more total inflight requests per peer, making the weight gap between idle and busy peers even larger.
The core issue is that inflight snap data requests and control-plane requests (pivot update, header resolution) share the same weight function. Snap inflight count should not penalize a peer's eligibility for control-plane requests, but at the same time we need a mechanism to distribute control-plane requests across eligible peers rather than always picking the same one.
Bug B — Deterministic peer selection never rotates
get_best_peeruses.max_by_key(weight_peer)which deterministically returns the same peer when weights are stable. Its "skip same peer" logic only alternates within the top-N. If the top-N peers don't have the requested block, every other peer in the table is never tried.Confirmed on hoodi run
run_20260411_033943(13 peers). Only 2 distinct peers were tried across 9 real attempts. At least 6 other peers hadBlockRangeUpdate.range_to >= 2593928(the requested block) — visible in the logs during the retry window — and were never asked.Two independent surfaces:
get_peer_connectionsreturns peers in HashMap iteration order (stable).request_block_headersdoestake(MAX_PEERS_TO_ASK)(5) and retries 3× → same 5 peers every retry.get_best_peerdeterministically returns the top-scored peer. On small networks the alternation window is pathologically narrow.Bug C — Available data not used for peer filtering
Peers send
BlockRangeUpdatemessages advertising their chain tip. This data is received and stored but never used for peer selection. On hoodi, the stuck peer0xeedf…df38sent aBlockRangeUpdatewithrange_to=2593886— 42 blocks behind the requested pivot 2593928. We kept asking it anyway.Similarly, on mainnet, the erigon peer likely hadn't synced to the requested pivot block. Filtering by
range_to >= requested_blockwould have eliminated both failures immediately.Compounding factors
"Waiting for different peer" counts as failures
Between real request attempts, the retry loop increments
total_failuresfor passive wait cycles ("peer X failed 3 times, waiting for a different peer"). On hoodi, 6 of 15 failures were passive waits. On mainnet, 11 of 15. This burns 40–73% of the failure budget without making any actual requests.Post-pivot header fetch has the same bugs
After
update_pivotsucceeds,snap_sync.rscallsrequest_block_headers(block_number + 1, pivot.hash())which usesask_peer_head_number→ same deterministic top-5 peers. Peers respond withEmptyResponseFromPeer(they don't have the just-updated pivot yet), returnsNoBlockHeaders→ classified as irrecoverable →process::exit(2). This was the failure mode in mainnet run #7.DB corruption on crash
process::exit(2)skips cleanup. On restart:Error: Unknown state found in DB. Please run 'ethrex removedb'. Every sync crash requires a full resync from scratch.Evidence
Hoodi failure (
run_20260411_033943)13 peers total. Failed requesting pivot block 2593928.
0xeedf…df38(Nimbus)0x501b…7bc8(Nethermind)0xeedf…df38(back to same)Only 2 distinct peers tried. 6+ peers with the requested block were never asked.
Additional context: 0 instances of our-side peer rejection, 1,818 instances of remote "Too Many Peers" disconnections. On hoodi we can't grow the peer pool beyond ~13 — most nodes are already full. This makes peer-selection breadth critical: the pool we have is all we'll get.
Mainnet failure (
run_20260414_011127)53 peers. Failed requesting pivot block 24874945. The stuck peer
0x32ea…14be(erigon, eth/70 only) had score 47, requests 0, weight 47. Eleven healthy Geth peers had score 50, requests ~47, weight 3.3 real requests to erigon, 11 "waiting for different peer" cycles (~180s wasted). Two other peers had
range_to=24874945at 02:09 — 5 minutes before the crash.After crash:
Error: Unknown state found in DB. Please run 'ethrex removedb'.Proposed fixes (priority order)
1. Don't let snap inflight count dominate control-plane peer selection
The weight function
score - inflight_requestsconflates data-plane load (snap fetches) with control-plane eligibility (pivot updates, header lookups). During snap sync, healthy peers carry 40–50 inflight requests, making them look worse than idle/incapable peers.Options:
weight = score - min(requests, 5)so weight never drops belowscore - 5.get_random_peerfor pivot updates — pivot updates are rare (~1 per sync cycle); random selection among eligibles avoids the determinism bug entirely.Note: while snap inflight shouldn't penalize control-plane selection, we still need to distribute control-plane load across eligible peers rather than always converging on one. Any solution should ensure that among equally-eligible peers, requests are spread (randomization, round-robin, or similar).
2. Filter peers by
BlockRangeUpdate.range_toBefore selecting a peer for
update_pivotor header requests, filter out peers whose lastBlockRangeUpdate.range_to < requested_block_number. The data is already received and stored — just not used in selection.3. Broaden peer rotation
get_best_peermust not return the same peer after repeated failures. Options (combinable):.max_by_keyties4. Don't count passive waits as failures
Only increment
total_failureson actual peer requests, not on "waiting for a different peer" cycles. The current behavior burns 40–73% of the failure budget without doing useful work.5. Fix post-pivot header fetch
The header fetch after
update_pivotsucceeds uses the same deterministic peer selection. Wrap in a retry loop with peer rotation (overlaps with #3). For the case where the new pivot hasn't propagated to most peers yet: consider choosing a slightly older pivot (e.g.,latest - 64instead oflatest).6. Reclassify pivot-update failures as recoverable
update_pivotreturningPeerHandlerError::BlockHeadersmaps toSyncError::PeerHandler→ classified as irrecoverable aroundsync.rs:258-259→process::exit(2). This should be retryable, not fatal. With fixes #1–#4 landing, the failure becomes much rarer, but when it does happen it should loop back and try again after a cooldown.7. Raise or rethink
MAX_TOTAL_FAILURES = 15Less critical once #1–#4 land. The cap is low enough that 15 failures can be exhausted in 2–3 minutes. Options: raise to 50+, make it per-peer-cycle instead of global, or remove entirely if irrecoverable classification is fixed (#6).
8. Graceful shutdown / DB resilience on sync failure
process::exit(2)leaves the DB in an inconsistent state. Either:This is independent of the peer selection bugs but compounds the impact — every crash currently requires a full resync.
Related work
Commits on
fullsync-acceleration/fullsync-improvementbranches address some of these:880244afe— weighted peer selection, record failure/success on header download, more fetch attempts, faster timeoutefaa344d4— reset failure counter on success (fail only on consecutive failures, not total)These should be evaluated for cherry-picking or porting.
Debugging tools
PR #6470 (
feat(l1): add snap sync observability endpoints and tooling) adds tooling specifically designed to debug this class of issue:admin_syncStatusRPC endpoint — live sync phase, pivot block, progress metricsadmin_peerScoresRPC endpoint — per-peer scores, inflight request counts, capabilitiestooling/sync/peer_top.py— live TUI showing peer scores, request distribution, and selection patterns in real timesnap_sync.rsThese tools were used to produce the forensic analysis in this issue and should be merged first to enable verification of fixes.
Why it fails ~20% of the time (not always)
Bug A (the
MAX_TOTAL_FAILURES = 15irrecoverable classification) was likely introduced as a regression by #6394 (fix(l1): p2p sync stall fixes and discovery hardening, merged 2026-03-25). Before that PR,update_pivotused an infinite retry loop with 1s backoff. #6394 changed it to a finite failure budget with exponential backoff — the intent was to prevent infinite stalls, but the low cap combined with Bugs B and C turned transient failures into fatal crashes.Bug B (deterministic peer selection) and Bug C (unused BlockRangeUpdate data) predate #6394, but were masked because the infinite retry loop eventually found a working peer by exhaustion. The regression in #6394 made the retry budget finite without fixing the underlying peer selection issues, turning a slow-but-working path into a crash.
Runs where all phases complete fast enough never need a pivot update → never hit the buggy path. Slower runs (larger state, slower peers, network congestion) trigger pivot staleness →
update_pivot→ the bugs.