Skip to content

fix(l1): pre-check sync head height to skip snap on low-block networks#6536

Open
edg-l wants to merge 1 commit intomainfrom
fix/snap-sync-low-block-network
Open

fix(l1): pre-check sync head height to skip snap on low-block networks#6536
edg-l wants to merge 1 commit intomainfrom
fix/snap-sync-low-block-network

Conversation

@edg-l
Copy link
Copy Markdown
Contributor

@edg-l edg-l commented Apr 27, 2026

Summary

  • When joining a fresh network whose head is only a few blocks deep, snap sync stalls in the header download loop before ever reaching the existing head_close_to_0 fallback in sync_cycle_snap (peers are barely synced themselves and request_block_headers returns no usable batch).
  • Adds an upfront probe in Syncer::sync_cycle: fetch the sync head's header by hash; if header.number < MIN_FULL_BLOCKS, flip snap_enabled to false and dispatch directly to sync_cycle_full.
  • If the probe can't reach a peer after 3 attempts (2s apart), falls through to the existing snap path unchanged — no behavior regression on healthy mainnet/testnet.

Test plan

  • Bring up a fresh local devnet of multiple ethrex nodes; verify a newly started node skips snap and full-syncs from genesis without stalling.
  • Confirm normal mainnet/testnet snap sync is unaffected (probe returns >= MIN_FULL_BLOCKS, snap path runs as before).
  • Confirm probe-failure case (no peers): snap path runs as before, no immediate fallback to full.

When joining a fresh network whose head is only a few blocks deep, snap
sync would stall in the header download loop before reaching the
existing `head_close_to_0` fallback. Probe the sync head's header
up-front; if its number is below MIN_FULL_BLOCKS, switch to full sync
immediately. Falls through to the existing snap path if the probe can't
reach any peer.
@edg-l edg-l requested a review from a team as a code owner April 27, 2026 14:21
@github-actions github-actions Bot added the L1 Ethereum client label Apr 27, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

This PR adds a pre-check to avoid snap sync on fresh devnets where the chain head is below MIN_FULL_BLOCKS. The implementation is correct and handles the edge case properly.

Highlights

  • Correct fallback semantics: Returning None from probe_sync_head_number when peers don't respond (lines 243-245, 269) correctly falls through to snap sync, delegating to the existing (potentially slower) in-loop guard rather than stalling indefinitely.
  • Bounded retry logic: The 3-attempt limit with 2-second delays (lines 236-237) prevents indefinite hangs while tolerating transient network issues.
  • Proper structured logging: Use of info! with field shorthand (line 218) and appropriate log levels (debug for probes, warn for errors).

Minor suggestions

  1. Line 256-257: Consider if request_block_headers_from_hash could be replaced with a single-header request API (if available) to avoid iterating through a batch response, though the current find logic is correct.

  2. Line 262, 265: For consistency with line 218, consider using structured fields instead of inline formatting:

    debug!(
        attempt,
        max_attempts = PROBE_SYNC_HEAD_ATTEMPTS,
        "Sync head probe: no peer response"
    );
  3. Documentation: The comment on line 207 references head_close_to_0—ensure this internal detail remains synchronized if the snap sync module is refactored later.

Security/Performance: No concerns. The probe adds minimal latency (max 6s) and the early full-sync return avoids expensive snap sync setup for short chains.

LGTM.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR adds an upfront probe in Syncer::sync_cycle that fetches the sync head's block header by hash before committing to snap sync. If the block number is below MIN_FULL_BLOCKS (10 000), snap_enabled is permanently set to false and the node dispatches directly to full::sync_cycle_full, bypassing the stall that previously occurred when peers on a fresh devnet couldn't serve a full header batch. The probe retries up to 3 times (2 s apart) and falls through to the original snap path on failure, so there is no regression on healthy mainnet/testnet nodes.

Confidence Score: 4/5

Safe to merge — fix is correct and consistent with existing snap/full transition logic; only P2 style notes remain.

No P0 or P1 issues found. The probe logic is sound: the header hash is cryptographically bound so a single-peer response is trustworthy, single-element header responses pass the chaining check (vacuous windows(2).all), and the fallback on probe failure is conservative. The two P2 comments are minor — one is a missing attempt number in a debug log, the other is a documentation gap around the permanent nature of the snap-disable decision.

No files require special attention.

Important Files Changed

Filename Overview
crates/networking/p2p/sync.rs Adds a pre-check probe before snap sync that fetches the sync head header by hash; if block number < MIN_FULL_BLOCKS it permanently disables snap and dispatches directly to full sync. Logic is correct and consistent with existing snap_sync.rs behaviour; two minor style/documentation P2s noted.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[sync_cycle called] --> B{snap_enabled?}
    B -- No --> F[full::sync_cycle_full]
    B -- Yes --> C[probe_sync_head_number\nup to 3 attempts]
    C -- None / probe failed --> E[snap::sync_cycle_snap]
    C -- Some number --> D{number < MIN_FULL_BLOCKS\n10 000}
    D -- No --> E
    D -- Yes --> G[snap_enabled = false\nfull::sync_cycle_full]
    E --> H{head_close_to_0\nor sync_head found?}
    H -- Yes --> I[snap_enabled = false\nfull::sync_cycle_full]
    H -- No --> J[continue snap sync loop]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync.rs
Line: 274-278

Comment:
**Missing attempt number in debug log**

When `Ok(Some(headers))` is returned but the target header isn't found, the debug message doesn't include the attempt number, unlike the other two branches. In a scenario where all 3 attempts hit this case (e.g. a peer consistently returns a response anchored to a different fork root), the three identical log lines with no attempt context make it harder to diagnose.

```suggestion
                debug!("Sync head probe attempt {attempt}/{PROBE_SYNC_HEAD_ATTEMPTS}: response did not contain target header");
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/networking/p2p/sync.rs
Line: 219

Comment:
**Permanent snap-disable on first eligible sync_head**

`snap_enabled` is set to `false` here before dispatching to full sync, and it is never reset to `true`. If the node is started while the devnet chain is short (< `MIN_FULL_BLOCKS`) but then the network advances past that threshold before the first full sync cycle completes, subsequent `sync_cycle` calls will still skip the snap path entirely.

This matches the existing behaviour in `snap_sync.rs` (line 225 sets the same flag permanently on `head_close_to_0`), so it is consistent — but worth documenting explicitly here since the pre-check fires much earlier and at a point where the "permanent" decision may be harder to notice.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(l1): pre-check sync head height to s..." | Re-trigger Greptile

Comment on lines +274 to +278
Ok(Some(headers)) => {
if let Some(header) = headers.iter().find(|h| h.hash() == sync_head) {
return Some(header.number);
}
debug!("Sync head probe: response did not contain target header");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing attempt number in debug log

When Ok(Some(headers)) is returned but the target header isn't found, the debug message doesn't include the attempt number, unlike the other two branches. In a scenario where all 3 attempts hit this case (e.g. a peer consistently returns a response anchored to a different fork root), the three identical log lines with no attempt context make it harder to diagnose.

Suggested change
Ok(Some(headers)) => {
if let Some(header) = headers.iter().find(|h| h.hash() == sync_head) {
return Some(header.number);
}
debug!("Sync head probe: response did not contain target header");
debug!("Sync head probe attempt {attempt}/{PROBE_SYNC_HEAD_ATTEMPTS}: response did not contain target header");
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync.rs
Line: 274-278

Comment:
**Missing attempt number in debug log**

When `Ok(Some(headers))` is returned but the target header isn't found, the debug message doesn't include the attempt number, unlike the other two branches. In a scenario where all 3 attempts hit this case (e.g. a peer consistently returns a response anchored to a different fork root), the three identical log lines with no attempt context make it harder to diagnose.

```suggestion
                debug!("Sync head probe attempt {attempt}/{PROBE_SYNC_HEAD_ATTEMPTS}: response did not contain target header");
```

How can I resolve this? If you propose a fix, please make it concise.

sync_head_number,
"Sync head below MIN_FULL_BLOCKS ({MIN_FULL_BLOCKS}), using full sync"
);
self.snap_enabled.store(false, Ordering::Relaxed);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Permanent snap-disable on first eligible sync_head

snap_enabled is set to false here before dispatching to full sync, and it is never reset to true. If the node is started while the devnet chain is short (< MIN_FULL_BLOCKS) but then the network advances past that threshold before the first full sync cycle completes, subsequent sync_cycle calls will still skip the snap path entirely.

This matches the existing behaviour in snap_sync.rs (line 225 sets the same flag permanently on head_close_to_0), so it is consistent — but worth documenting explicitly here since the pre-check fires much earlier and at a point where the "permanent" decision may be harder to notice.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync.rs
Line: 219

Comment:
**Permanent snap-disable on first eligible sync_head**

`snap_enabled` is set to `false` here before dispatching to full sync, and it is never reset to `true`. If the node is started while the devnet chain is short (< `MIN_FULL_BLOCKS`) but then the network advances past that threshold before the first full sync cycle completes, subsequent `sync_cycle` calls will still skip the snap path entirely.

This matches the existing behaviour in `snap_sync.rs` (line 225 sets the same flag permanently on `head_close_to_0`), so it is consistent — but worth documenting explicitly here since the pre-check fires much earlier and at a point where the "permanent" decision may be harder to notice.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown

Lines of code report

Total lines added: 46
Total lines removed: 0
Total lines changed: 46

Detailed view
+--------------------------------------+-------+------+
| File                                 | Lines | Diff |
+--------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync.rs | 358   | +46  |
+--------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

PR #6536 Review: fix(l1): pre-check sync head height to skip snap on low-block networks

Overall: The fix is well-motivated and targets a real stall condition. The implementation is mostly correct. A few things worth addressing.


Correctness

Permanent snap_enabled = false on short chains

sync.rs:219: self.snap_enabled.store(false, Ordering::Relaxed) is a permanent write to a shared Arc<AtomicBool>. There is no re-enable path — once this fires, every subsequent sync_cycle call goes to the full-sync branch for the lifetime of the Syncer. For a devnet that stays short this is fine, but if a node is started while the chain is only a few blocks deep and the network later grows past MIN_FULL_BLOCKS, it stays locked in full sync. This matches the behavior of the existing head_close_to_0 guard in snap_sync.rs, so it's consistent — but the PR description should mention this as intentional.

find() iterates and hashes all returned headers

sync.rs:274:

if let Some(header) = headers.iter().find(|h| h.hash() == sync_head) {

h.hash() runs a keccak256 per element. Since the request is made with NewToOld ordering starting from sync_head, the first header in the response should be sync_head. Iterating the whole batch (which may be hundreds of headers by default) and hashing each one wastes CPU. Prefer:

if let Some(header) = headers.first().filter(|h| h.hash() == sync_head) {
    return Some(header.number);
}

This keeps the security check (validates peer didn't send a wrong chain) while touching only one hash.

Full batch request for a single-header probe

request_block_headers_from_hash sends a standard GetBlockHeaders RLP message, which returns a full batch of headers (potentially up to the peer's limit). The probe only needs one header. If PeerHandler doesn't expose a single-header variant, this is acceptable as-is, but it's worth a comment explaining why a batch function is used for a probe.


Rust Idioms

if let chain syntax (MSRV)

sync.rs:209:

if let Some(sync_head_number) = probe_sync_head_number(&mut self.peers, sync_head).await
    && sync_head_number < MIN_FULL_BLOCKS

let chains (if let … && condition) were stabilized in Rust 1.88.0 (May 2025). Confirm the project's MSRV is ≥ 1.88, or there will be a compiler error on older toolchains.


Minor Issues

Sleep on last failed attempt is correctly avoidedif attempt < PROBE_SYNC_HEAD_ATTEMPTS is handled. Good.

warn! vs debug! log level on probe failure — the Err branch uses warn!, which is appropriate for unexpected network errors. The Ok(None) branch uses debug!, which is correct for an expected "no peer available" case. Consistent.

METRICS.enable() not called before early full-sync dispatchsync.rs:218–225: When the probe triggers the full-sync path, METRICS.enable() (called at sync.rs:222 in the normal snap path) is skipped. The existing head_close_to_0 guard in snap_sync.rs calls METRICS.disable() before handing off, implying metrics was already enabled at that point. Here metrics was never enabled, so no disable is needed — but verify that sync_cycle_full doesn't assume metrics are in a particular state.

No automated tests — the test plan is entirely manual. Given the async/network nature this is understandable, but a unit test that mocks PeerHandler to verify the pre-check logic (below threshold → full sync, above threshold → snap, no response → snap) would catch regressions.


Summary

The fix is correct in its primary goal and the fallback behavior (no-peer → snap path unchanged) is safe. Address the find()-vs-first() hashing inefficiency and confirm the MSRV supports let chains. Everything else is minor polish.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

Findings

  1. sync.rs:219: this new early-return path disables snap and jumps straight into full sync without clearing any persisted snap checkpoint. SyncManager keeps looping while get_header_download_checkpoint() is still set (sync_manager.rs:223), and the only cleanup hook for that state is clear_snap_state() (store.rs:834). If this branch is hit while resuming a partial snap sync, the node can keep re-entering full sync with a stale checkpoint and never satisfy the manager’s exit condition. This branch should clear snap state before returning, or the manager loop should also stop once snap has been disabled.

  2. sync.rs:268: probe_sync_head_number uses request_block_headers_from_hash(), which sends GetBlockHeaders with limit: BLOCK_HEADER_LIMIT (peer_handler.rs:437, blocks.rs:84) and retries it 3 times. With PEER_REPLY_TIMEOUT = 5s, this can add about 19s before the existing snap-sync retry loop even starts, and it does so just to learn one block number. There is already a single-header path in peer_handler.rs:51; reusing/exposing that would avoid the extra payload and peer-scoring churn on every snap attempt.

Otherwise the change is small and I don’t see EVM, gas-accounting, trie, or consensus-rule issues in the touched logic.

I couldn’t run cargo check here because the sandboxed environment blocks rustup from writing under /home/runner/.rustup.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants