fix(l1): fix storage range fetching for big accounts#6553
fix(l1): fix storage range fetching for big accounts#6553
Conversation
Lines of code reportTotal lines added: Detailed view |
🤖 Claude Code ReviewNow I have everything I need. Here is my review: PR #6553: fix(l1): fix storage range fetching for big accountsOverviewThe PR fixes a real correctness bug in snap sync: big accounts (those with existing interval tracking) were permanently stuck in The fix partitions scheduling into two explicit paths:
The approach is sound. A few issues are worth addressing: Correctness Issues1. Linear scan picks the last matching account, not the first
let mut acc_hash: H256 = H256::zero();
for account in accounts_by_root_hash[start_index].1.iter() {
if let Some((_, old_intervals)) = account_storage_roots
.accounts_with_storage_root
.get(account)
&& !old_intervals.is_empty()
{
acc_hash = *account; // no break — overwrites on every match
}
}Because there is no let acc_hash = accounts_by_root_hash[start_index].1.iter().find(|account| {
account_storage_roots
.accounts_with_storage_root
.get(*account)
.map_or(false, |(_, ivs)| !ivs.is_empty())
}).copied();2. Interval matched by end hash only New diff line 113: .position(|(_old_start, end)| end == &hash_end)The 3. Interval cleanup inconsistency for shared-root account groups When an interval is drained, the code removes it from Minor Issues4. Two-level Diff line 53: } else {
if let Some(start) = bulk_chunk_start {This is a Clippy lint ( 5. Very long line in Diff line 110 has a 130+ character line for the 6.
Missing Test CoverageThe new per-interval scheduling path and the interval completion/removal path have no unit tests. Given that the original bug was subtle (it only manifested when a peer's response exactly covered a big account without truncating), a regression test — even a minimal one with a mocked peer — would prevent future regressions and validate the accounting logic. This is the most impactful gap in the PR. SummaryThe root cause diagnosis is correct and the fix is logically sound. Items 1 and 2 above are the most important to address before merging: the wrong-account selection and end-hash-only matching could produce incorrect behavior in multi-account-per-root groups or edge cases with interval overlap. Item 3 (stale map entries) is lower severity but leaves the tracking state inconsistent. A regression test would significantly strengthen confidence in the fix. Automated review by Claude (Anthropic) · sonnet · custom prompt |
Greptile SummaryThis PR fixes a liveness bug in snap-sync storage downloading: accounts with large storage tries would get stuck because the per-interval re-queue path only fired on truncated peer responses, so a peer that covered the whole account in one reply left the intervals undrained forever. The fix partitions tasks at function entry — fresh accounts get bulk tasks, previously-split accounts get one task per recorded interval — and adds a completion handler that removes the finished interval and finalises the account group when all intervals are gone.
Confidence Score: 3/5The fix resolves the described bug but introduces a latent regression path for accounts sharing a storage root due to non-deterministic HashMap iteration affecting interval lookup. A P1 finding (non-deterministic first_account selection) can reproduce the original liveness bug for a specific account topology, capping confidence at 4; the additional P2 (end-hash-only matching) lowers it slightly further to 3. crates/networking/p2p/snap/client.rs — partitioning loop (line ~571) and interval-removal handler (line ~944)
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/snap/client.rs | Fixes stuck storage-range download for big accounts by partitioning tasks into bulk and per-interval paths at function entry; introduces an interval-removal handler for fully-covered per-interval tasks, but the partitioning relies on first_account (non-deterministic HashMap iteration order) to detect existing intervals, which can silently fall back to the buggy bulk path when multiple accounts share a storage root. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[request_storage_ranges called] --> B[Build accounts_by_root_hash from HashMap]
B --> C{For each group: check first_account.intervals}
C -- empty intervals --> D[Schedule bulk StorageTask start_hash=zero]
C -- non-empty intervals --> E[Schedule per-interval StorageTasks]
D --> F[Task dispatched to peer]
E --> F
F --> G{peer response: remaining_start < remaining_end?}
G -- yes, partial --> H{hash_start.is_zero?}
H -- yes bulk partial --> I[Re-queue remaining bulk chunk]
H -- no, per-interval partial --> J[Update interval start_hash, re-queue]
G -- no, fully covered --> K{hash_end.is_some?}
K -- no, bulk complete --> L[Mark accounts done]
K -- yes, per-interval complete NEW PATH --> M[Find acc with non-empty intervals in group]
M --> N[Remove matching interval by end_hash]
N --> O{intervals empty?}
O -- yes --> P[Mark all group accounts done + healed]
O -- no --> Q[More intervals remain, await other tasks]
G -- no, fully covered & hash_start!=0 & no hash_end --> R[Big account detected: split into chunks, store intervals]
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
crates/networking/p2p/snap/client.rs:571-583
**Interval lookup keyed on non-deterministic `first_account`**
The partitioning loop decides bulk vs. per-interval by looking up `first_account.intervals`, where `first_account` is `accounts.first()` from a `Vec` built by iterating `account_storage_roots.accounts_with_storage_root` — a `HashMap` whose iteration order is non-deterministic across calls.
The split path stores intervals under exactly one key, `first_acc_hash = accounts_by_root_hash[remaining_start].1.first()`, which was determined in a *previous* call's `HashMap` iteration order. In a subsequent call, `accounts_by_root_hash[i].1.first()` can land on a *different* account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.
### Issue 2 of 2
crates/networking/p2p/snap/client.rs:944-946
**Interval matched by end-hash only**
`position(|(_old_start, end)| end == &hash_end)` ignores `start_hash` when locating the completed interval. If two intervals in the same account happen to share the same `end_hash` (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both `(start_hash, end_hash)` would make the lookup unambiguous.
```suggestion
let pos = old_intervals
.iter()
.position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
.ok_or(SnapError::InternalError(
"Could not find an old interval that we were tracking".to_owned(),
))?;
```
Reviews (1): Last reviewed commit: "snap: address review nits in scheduling ..." | Re-trigger Greptile
| for (i, (_, accounts)) in accounts_by_root_hash.iter().enumerate() { | ||
| let first_account = *accounts.first().ok_or_else(|| { | ||
| SnapError::InternalError("Empty accounts vector while scheduling tasks".to_owned()) | ||
| })?; | ||
| let intervals = &account_storage_roots | ||
| .accounts_with_storage_root | ||
| .get(&first_account) | ||
| .ok_or_else(|| { | ||
| SnapError::InternalError( | ||
| "Could not find intervals for account while scheduling".to_owned(), | ||
| ) | ||
| })? | ||
| .1; |
There was a problem hiding this comment.
Interval lookup keyed on non-deterministic
first_account
The partitioning loop decides bulk vs. per-interval by looking up first_account.intervals, where first_account is accounts.first() from a Vec built by iterating account_storage_roots.accounts_with_storage_root — a HashMap whose iteration order is non-deterministic across calls.
The split path stores intervals under exactly one key, first_acc_hash = accounts_by_root_hash[remaining_start].1.first(), which was determined in a previous call's HashMap iteration order. In a subsequent call, accounts_by_root_hash[i].1.first() can land on a different account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/snap/client.rs
Line: 571-583
Comment:
**Interval lookup keyed on non-deterministic `first_account`**
The partitioning loop decides bulk vs. per-interval by looking up `first_account.intervals`, where `first_account` is `accounts.first()` from a `Vec` built by iterating `account_storage_roots.accounts_with_storage_root` — a `HashMap` whose iteration order is non-deterministic across calls.
The split path stores intervals under exactly one key, `first_acc_hash = accounts_by_root_hash[remaining_start].1.first()`, which was determined in a *previous* call's `HashMap` iteration order. In a subsequent call, `accounts_by_root_hash[i].1.first()` can land on a *different* account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.
How can I resolve this? If you propose a fix, please make it concise.| let pos = old_intervals | ||
| .iter() | ||
| .position(|(_old_start, end)| end == &hash_end) |
There was a problem hiding this comment.
Interval matched by end-hash only
position(|(_old_start, end)| end == &hash_end) ignores start_hash when locating the completed interval. If two intervals in the same account happen to share the same end_hash (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both (start_hash, end_hash) would make the lookup unambiguous.
| let pos = old_intervals | |
| .iter() | |
| .position(|(_old_start, end)| end == &hash_end) | |
| let pos = old_intervals | |
| .iter() | |
| .position(|(old_start, end)| old_start == &hash_start && end == &hash_end) | |
| .ok_or(SnapError::InternalError( | |
| "Could not find an old interval that we were tracking".to_owned(), | |
| ))?; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/snap/client.rs
Line: 944-946
Comment:
**Interval matched by end-hash only**
`position(|(_old_start, end)| end == &hash_end)` ignores `start_hash` when locating the completed interval. If two intervals in the same account happen to share the same `end_hash` (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both `(start_hash, end_hash)` would make the lookup unambiguous.
```suggestion
let pos = old_intervals
.iter()
.position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
.ok_or(SnapError::InternalError(
"Could not find an old interval that we were tracking".to_owned(),
))?;
```
How can I resolve this? If you propose a fix, please make it concise.
🤖 Codex Code ReviewNo blocking findings. The new task partitioning around client.rs:561 and the explicit “completed interval” cleanup in client.rs:918 fit the existing big-account bookkeeping, and I don’t see a correctness, security, or consensus-risk regression in the changed logic. Residual risk: I don’t see targeted regression coverage for the exact case this fixes. A focused test that starts with persisted Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Motivation
Storage range download would get stuck on large networks with small pivot times.
Description
We always queued bulk tasks from [0, MAX] regardless of an account's existing intervals, relying on the response handler's big account split path to re-fire each call. That path only fires when a peer's response is truncated. When a peer fully covered a known-big account in one response, no branch ran, and the account's intervals stayed in accounts_with_storage_root forever.