Skip to content

fix(l1): fix storage range fetching for big accounts#6553

Open
iovoid wants to merge 3 commits intomainfrom
fix/snap-schedule-big-accounts
Open

fix(l1): fix storage range fetching for big accounts#6553
iovoid wants to merge 3 commits intomainfrom
fix/snap-schedule-big-accounts

Conversation

@iovoid
Copy link
Copy Markdown
Contributor

@iovoid iovoid commented Apr 30, 2026

Motivation

Storage range download would get stuck on large networks with small pivot times.

Description

We always queued bulk tasks from [0, MAX] regardless of an account's existing intervals, relying on the response handler's big account split path to re-fire each call. That path only fires when a peer's response is truncated. When a peer fully covered a known-big account in one response, no branch ran, and the account's intervals stayed in accounts_with_storage_root forever.

@iovoid iovoid requested a review from a team as a code owner April 30, 2026 15:36
@github-actions github-actions Bot added the L1 Ethereum client label Apr 30, 2026
@github-actions
Copy link
Copy Markdown

Lines of code report

Total lines added: 72
Total lines removed: 0
Total lines changed: 72

Detailed view
+---------------------------------------------+-------+------+
| File                                        | Lines | Diff |
+---------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/snap/client.rs | 1238  | +72  |
+---------------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have everything I need. Here is my review:


PR #6553: fix(l1): fix storage range fetching for big accounts

Overview

The PR fixes a real correctness bug in snap sync: big accounts (those with existing interval tracking) were permanently stuck in accounts_with_storage_root because the task scheduler queued all accounts from [0, MAX] unconditionally. The "big account split" path in the response handler only fires when a peer truncates its response; if a peer fully covered a big account in one response, the split path never ran, intervals never drained, and the account was stuck forever.

The fix partitions scheduling into two explicit paths:

  • Bulk tasks for fresh accounts with empty intervals (batched as before).
  • Per-interval tasks for big accounts, one task per interval, and a new completion path that removes each interval after it's served.

The approach is sound. A few issues are worth addressing:


Correctness Issues

1. Linear scan picks the last matching account, not the first

crates/networking/p2p/snap/client.rs, new diff lines 94–101:

let mut acc_hash: H256 = H256::zero();
for account in accounts_by_root_hash[start_index].1.iter() {
    if let Some((_, old_intervals)) = account_storage_roots
        .accounts_with_storage_root
        .get(account)
        && !old_intervals.is_empty()
    {
        acc_hash = *account;   // no break — overwrites on every match
    }
}

Because there is no break, if the group at start_index contains multiple accounts with non-empty intervals (possible when multiple accounts share the same storage root), acc_hash ends up being the last one rather than any deterministic choice. The subsequent get_mut + remove only cleans up that one account's entry, while the others retain the stale interval. A find-based approach is cleaner and makes intent explicit:

let acc_hash = accounts_by_root_hash[start_index].1.iter().find(|account| {
    account_storage_roots
        .accounts_with_storage_root
        .get(*account)
        .map_or(false, |(_, ivs)| !ivs.is_empty())
}).copied();

2. Interval matched by end hash only

New diff line 113:

.position(|(_old_start, end)| end == &hash_end)

The _old_start is ignored. If two tracked intervals for the same account happen to share an end hash (edge case, but possible in pathological state), this silently removes the wrong interval. Since start_hash is available in the StorageTask currently being processed, matching on (start_hash, end_hash) would be strictly safer.

3. Interval cleanup inconsistency for shared-root account groups

When an interval is drained, the code removes it from acc_hash's entry (diff line 117) but marks all accounts in the group as done (diff lines 119–122). If every account in the group has its own separate entry in accounts_with_storage_root (which is the expected case — they all share the same storage root and therefore the same intervals), the other accounts' entries are never cleaned up. They stay with stale intervals in the map forever, even though those accounts are added to healed_accounts. This is a memory leak of tracking state. Either clean up all accounts' entries, or document/enforce the invariant that only first_account (the canonical representative) holds the interval list.


Minor Issues

4. Two-level if/else if could be an else clause

Diff line 53:

} else {
    if let Some(start) = bulk_chunk_start {

This is a Clippy lint (clippy::collapsible_else_if) and can be written as } else if let Some(start) = bulk_chunk_start {.

5. Very long line in get_mut error

Diff line 110 has a 130+ character line for the ok_or message. Wrapping it to match the surrounding style (as is done for the other ok_or_else calls) would keep the file consistent.

6. InternalError used for invariant violations

SnapError::InternalError is returned for conditions like "empty accounts vector" and "interval not found", which are programmer invariants, not recoverable runtime errors. Using expect() or a debug_assert! before the operation would communicate that intent more clearly, though returning Err is the more conservative choice for a network-path function. This is a style preference.


Missing Test Coverage

The new per-interval scheduling path and the interval completion/removal path have no unit tests. Given that the original bug was subtle (it only manifested when a peer's response exactly covered a big account without truncating), a regression test — even a minimal one with a mocked peer — would prevent future regressions and validate the accounting logic. This is the most impactful gap in the PR.


Summary

The root cause diagnosis is correct and the fix is logically sound. Items 1 and 2 above are the most important to address before merging: the wrong-account selection and end-hash-only matching could produce incorrect behavior in multi-account-per-root groups or edge cases with interval overlap. Item 3 (stale map entries) is lower severity but leaves the tracking state inconsistent. A regression test would significantly strengthen confidence in the fix.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR fixes a liveness bug in snap-sync storage downloading: accounts with large storage tries would get stuck because the per-interval re-queue path only fired on truncated peer responses, so a peer that covered the whole account in one reply left the intervals undrained forever. The fix partitions tasks at function entry — fresh accounts get bulk tasks, previously-split accounts get one task per recorded interval — and adds a completion handler that removes the finished interval and finalises the account group when all intervals are gone.

  • The partitioning loop keys the bulk-vs-per-interval decision on first_account, taken from a Vec built by iterating a HashMap (non-deterministic order). Because the split path stores intervals under the first_acc_hash determined in a different call, first_account can differ between calls for groups with more than one account sharing a storage root, silently rescheduling a big account as a bulk task and reproducing the original stuck behavior.
  • The completion handler matches an interval for removal by end_hash alone; matching (start_hash, end_hash) would be safer.

Confidence Score: 3/5

The fix resolves the described bug but introduces a latent regression path for accounts sharing a storage root due to non-deterministic HashMap iteration affecting interval lookup.

A P1 finding (non-deterministic first_account selection) can reproduce the original liveness bug for a specific account topology, capping confidence at 4; the additional P2 (end-hash-only matching) lowers it slightly further to 3.

crates/networking/p2p/snap/client.rs — partitioning loop (line ~571) and interval-removal handler (line ~944)

Important Files Changed

Filename Overview
crates/networking/p2p/snap/client.rs Fixes stuck storage-range download for big accounts by partitioning tasks into bulk and per-interval paths at function entry; introduces an interval-removal handler for fully-covered per-interval tasks, but the partitioning relies on first_account (non-deterministic HashMap iteration order) to detect existing intervals, which can silently fall back to the buggy bulk path when multiple accounts share a storage root.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[request_storage_ranges called] --> B[Build accounts_by_root_hash from HashMap]
    B --> C{For each group: check first_account.intervals}
    C -- empty intervals --> D[Schedule bulk StorageTask start_hash=zero]
    C -- non-empty intervals --> E[Schedule per-interval StorageTasks]
    D --> F[Task dispatched to peer]
    E --> F
    F --> G{peer response: remaining_start < remaining_end?}
    G -- yes, partial --> H{hash_start.is_zero?}
    H -- yes bulk partial --> I[Re-queue remaining bulk chunk]
    H -- no, per-interval partial --> J[Update interval start_hash, re-queue]
    G -- no, fully covered --> K{hash_end.is_some?}
    K -- no, bulk complete --> L[Mark accounts done]
    K -- yes, per-interval complete NEW PATH --> M[Find acc with non-empty intervals in group]
    M --> N[Remove matching interval by end_hash]
    N --> O{intervals empty?}
    O -- yes --> P[Mark all group accounts done + healed]
    O -- no --> Q[More intervals remain, await other tasks]
    G -- no, fully covered & hash_start!=0 & no hash_end --> R[Big account detected: split into chunks, store intervals]
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
crates/networking/p2p/snap/client.rs:571-583
**Interval lookup keyed on non-deterministic `first_account`**

The partitioning loop decides bulk vs. per-interval by looking up `first_account.intervals`, where `first_account` is `accounts.first()` from a `Vec` built by iterating `account_storage_roots.accounts_with_storage_root` — a `HashMap` whose iteration order is non-deterministic across calls.

The split path stores intervals under exactly one key, `first_acc_hash = accounts_by_root_hash[remaining_start].1.first()`, which was determined in a *previous* call's `HashMap` iteration order. In a subsequent call, `accounts_by_root_hash[i].1.first()` can land on a *different* account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.

### Issue 2 of 2
crates/networking/p2p/snap/client.rs:944-946
**Interval matched by end-hash only**

`position(|(_old_start, end)| end == &hash_end)` ignores `start_hash` when locating the completed interval. If two intervals in the same account happen to share the same `end_hash` (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both `(start_hash, end_hash)` would make the lookup unambiguous.

```suggestion
                    let pos = old_intervals
                        .iter()
                        .position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
                        .ok_or(SnapError::InternalError(
                            "Could not find an old interval that we were tracking".to_owned(),
                        ))?;
```

Reviews (1): Last reviewed commit: "snap: address review nits in scheduling ..." | Re-trigger Greptile

Comment on lines +571 to +583
for (i, (_, accounts)) in accounts_by_root_hash.iter().enumerate() {
let first_account = *accounts.first().ok_or_else(|| {
SnapError::InternalError("Empty accounts vector while scheduling tasks".to_owned())
})?;
let intervals = &account_storage_roots
.accounts_with_storage_root
.get(&first_account)
.ok_or_else(|| {
SnapError::InternalError(
"Could not find intervals for account while scheduling".to_owned(),
)
})?
.1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Interval lookup keyed on non-deterministic first_account

The partitioning loop decides bulk vs. per-interval by looking up first_account.intervals, where first_account is accounts.first() from a Vec built by iterating account_storage_roots.accounts_with_storage_root — a HashMap whose iteration order is non-deterministic across calls.

The split path stores intervals under exactly one key, first_acc_hash = accounts_by_root_hash[remaining_start].1.first(), which was determined in a previous call's HashMap iteration order. In a subsequent call, accounts_by_root_hash[i].1.first() can land on a different account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/snap/client.rs
Line: 571-583

Comment:
**Interval lookup keyed on non-deterministic `first_account`**

The partitioning loop decides bulk vs. per-interval by looking up `first_account.intervals`, where `first_account` is `accounts.first()` from a `Vec` built by iterating `account_storage_roots.accounts_with_storage_root` — a `HashMap` whose iteration order is non-deterministic across calls.

The split path stores intervals under exactly one key, `first_acc_hash = accounts_by_root_hash[remaining_start].1.first()`, which was determined in a *previous* call's `HashMap` iteration order. In a subsequent call, `accounts_by_root_hash[i].1.first()` can land on a *different* account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +944 to +946
let pos = old_intervals
.iter()
.position(|(_old_start, end)| end == &hash_end)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Interval matched by end-hash only

position(|(_old_start, end)| end == &hash_end) ignores start_hash when locating the completed interval. If two intervals in the same account happen to share the same end_hash (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both (start_hash, end_hash) would make the lookup unambiguous.

Suggested change
let pos = old_intervals
.iter()
.position(|(_old_start, end)| end == &hash_end)
let pos = old_intervals
.iter()
.position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
.ok_or(SnapError::InternalError(
"Could not find an old interval that we were tracking".to_owned(),
))?;
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/snap/client.rs
Line: 944-946

Comment:
**Interval matched by end-hash only**

`position(|(_old_start, end)| end == &hash_end)` ignores `start_hash` when locating the completed interval. If two intervals in the same account happen to share the same `end_hash` (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both `(start_hash, end_hash)` would make the lookup unambiguous.

```suggestion
                    let pos = old_intervals
                        .iter()
                        .position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
                        .ok_or(SnapError::InternalError(
                            "Could not find an old interval that we were tracking".to_owned(),
                        ))?;
```

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

No blocking findings.

The new task partitioning around client.rs:561 and the explicit “completed interval” cleanup in client.rs:918 fit the existing big-account bookkeeping, and I don’t see a correctness, security, or consensus-risk regression in the changed logic.

Residual risk: I don’t see targeted regression coverage for the exact case this fixes. A focused test that starts with persisted old_intervals, reruns request_storage_ranges, and hits the remaining_start == remaining_end && hash_end.is_some() path would make this safer to maintain. I also couldn’t run cargo test here because the sandboxed toolchain failed creating temp files under /home/runner/.rustup.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant