perf(l1): parallelize within-trie build for large storage tries in snap sync insert_storages

## Motivation

During snap sync's `insert_storages` phase, **a single large account (the "monster") runs single-threaded and dominates its own trie-build wall time**. While the overall phase is already parallel across many accounts (16 worker threads, one task per account), the within-trie build of any individual account is sequential. For the handful of very large contracts on mainnet (Uniswap-class), this single-threaded path measurably impacts total sync time.

This issue tracks parallelizing the within-trie build for large storage tries in snap sync, analogous to what PR #6410 did for the single account trie via 16-nibble split.

**Distinct from #5482:** that issue is about block-execution merkelization (parallelizing per-tx storage updates for hot contracts). This issue is about snap sync's initial trie construction from downloaded storage slots. Similar 16-nibble idea, different code paths:
- **#5482 (block execution):** `compute_state_root_with_updates` path, incremental updates after tx execution
- **This issue (snap sync):** `insert_storages` path in `snap_sync.rs`, bulk insertion from downloaded `StorageRanges` responses

## Current state

`insert_storages` dispatches work as follows:

```
Dispatcher (1 thread) ──► 16-slot worker pool
  ├── Task: build_storage_trie(account_A) [thread 1]
  ├── Task: build_storage_trie(account_B) [thread 2]
  ├── ...
  └── Task: build_storage_trie(monster)   [thread 9, runs 245s alone]
```

The per-account trie traversal inside `trie_from_sorted_accounts_with_stats` (`trie_sorted.rs:189-237`) runs on a single worker thread. Workers offload flush work to the shared pool via `scope.execute_priority`, so there's some parallelism at the flush layer — but the main traversal is sequential per account.

### What PR #6410 did (for comparison)

PR #6410 parallelized the *account* trie build (the single big trie of all 26M accounts) by splitting it across 16 first-nibble ranges via `trie_from_sorted_parallel`. That was a **within-trie** parallelization of the top-level state trie. Measured 30% improvement on `insert_accounts`.

This issue proposes the analogous change **inside each large storage task**: when the storage trie being built is big, split its construction across 16 storage-slot-nibble ranges.

## Profiling baseline (mainnet run `20260412_172457`)

`insert_storages` aggregate:

```
wall            = 2347.6s   (39m 07s)
total_trie_cpu  = 18965.1s  (sum across all tasks)
parallelism     = 8.1x      (of 16 threads → 50% utilization)
wait_time       = 1619.2s   (dispatcher blocked on recv, 69% of wall)
accounts        = 26,311,772
total_leaves    = 1,552,108,325
max_trie        = 244.9s    (monster)
```

### Per-account distribution: the "large" tail

30 accounts out of 26.3M qualify as >5s wall time:

| Account prefix | Leaves | Wall | `cpu` | `io_wait` (buffer wait) | Flushes |
|---|---|---|---|---|---|
| `159e489…` (the **monster**) | 159,600,000 | **244.9s** | 242.5s | 2.5s | 10,634 |
| `ab14d68…` | 23,700,000 | 34.5s | 34.5s | 0.0s | 1,636 |
| `1ff0800…` | 17,700,000 | 32.7s | 31.2s | 1.5s | 1,210 |
| (4 more) | 13.5–16.3M | 17–23s each | ≈wall | 0.0s | — |
| (~25 more) | 3–10M | 5–15s each | ≈wall | 0.0s | — |
| rest | avg <100K | avg <1ms | — | — | — |

Monster characteristics:
- **99% CPU-bound** (`io_wait`/buffer-wait = 2.5s of 245s)
- 10,634 flushes over the 245s run (flush pool keeps up)
- Most likely Uniswap v3 or similar high-activity contract based on leaf count

Instrumentation source: `TrieInsertStats` on branch `perf/snap-sync-profiling` (not on main; based on #6470).

## Realistic savings estimate

### Theoretical upper bound (16× split, perfect scaling)

Monster goes from 244.9s → ~15s → saves ~230s.

### Realistic (accounting for contention and uneven nibble distribution)

**~100–150s saved**, which is **~5–6% of the storage phase wall time** (39m → ~36–37m).

Reasons it's less than the theoretical upper bound:

1. **Uneven nibble distribution.** Real storage slot hashes aren't uniformly distributed across 16 buckets for any given contract. Some nibbles will have 2–3× more leaves than others, capping speedup.
2. **Shared buffer/flush pool contention.** The 32-buffer pool is sized for the current parallelism. 16 extra concurrent subtasks all competing for buffers and flushes could cause buffer exhaustion (which currently is not a bottleneck — see the 2.5s `io_wait` on the monster — but that's with only one traversal running).
3. **Other threads weren't idle during the monster anyway.** When the monster runs on one thread, the other 15 threads are processing the long tail of smaller accounts. Their wall time during the monster's 245s is not "free" — they're consuming ~229s of work. So the monster's single-threaded run doesn't cost us a full 15×245 = 3,675 thread-seconds of idle time.

### Thread-utilization perspective (the 80% finding)

Total idle thread-seconds in the storage phase:
- Available: 16 threads × 2347s = 37,554 thread-seconds
- Used (`total_trie_cpu`): 18,965 thread-seconds
- **Idle: 18,589 thread-seconds (49%)**

If the monster ran solo with 15 threads fully idle for 244.9s, that would account for 244.9 × 15 = 3,674 idle thread-seconds — **only ~20% of the total idle time**.

The other **~80%** comes from dispatcher overhead on the small-account tail (26.3M accounts × avg <1ms, dispatcher blocked 69% of wall on slot turnover). That's tracked separately in #6476 (small-account batching) — **higher-value opportunity than this issue** on current profiling data.

Both fixes are additive:
- **This issue:** reclaims a fraction of the 20% (monster's idle time)
- **#6476:** reclaims a fraction of the 80% (small-account dispatch overhead)

## Proposed implementation

Apply `trie_from_sorted_parallel` (from `trie_sorted.rs`, used in #6410) inside the per-account storage task when the account's leaf count exceeds a threshold:

```rust
// Sketch in insert_storages worker path
fn build_storage_trie(account_hash: H256, slots: impl Iterator<Item = (H256, U256)>) {
    let count = /* estimate or pre-count */;
    if count > LARGE_STORAGE_THRESHOLD {
        // Split across 16 storage-slot-nibble ranges, parallel build
        trie_from_sorted_parallel(slots, &thread_pool, /* ... */);
    } else {
        // Current single-threaded path
        trie_from_sorted_accounts_with_stats(slots);
    }
}
```

### Key design choices

1. **Threshold for "large."** The distribution is highly bimodal (median <1ms, tail 5–245s), so the threshold isn't critical. Suggest ~1M leaves or ~3s historical time. Below that, parallelization overhead likely exceeds gain.

2. **Pool sharing strategy.** `trie_from_sorted_parallel` uses a thread pool for its 16 subtasks. The outer `insert_storages` dispatcher also uses a pool. Options:
   - **Shared pool** (what PR #6410 does for `insert_accounts`): one 16-thread pool, subtasks and outer tasks contend for slots. Risk: monster subtasks starve the long tail.
   - **Nested pool**: outer 16-thread pool + inner N-thread pool per large task. Cleaner isolation but more threads total.
   - **Adaptive**: monster subtasks get pool priority (`execute_priority`) like flushes do today.

3. **Buffer pool sizing.** Currently 32 buffers. With 16 concurrent subtasks on the monster AND 15 other tasks on small accounts, that's 31 potentially concurrent users. May need to bump the pool or confirm buffer contention doesn't regress. Measure `io_wait` (buffer wait) on the monster before/after.

## Validation plan

1. **Baseline.** Reproduce `wall`, `parallelism`, `wait_time`, and the monster's per-account stats on a mainnet sync with the current code.
2. **Implement** the nibble split for `trie_from_sorted_parallel` inside storage tasks, with `LARGE_STORAGE_THRESHOLD` chosen to catch the top ~30 accounts.
3. **Compare** per-account `wall`/`cpu`/`io_wait` for the monster and other large accounts before/after.
4. **Verify no regression on the long tail.** Measure `wait_time` (dispatcher blocked time) — if it increases, buffer/slot contention has regressed the small-account path.
5. **Correctness.** Run the full sync with `release-with-debug-assertions` profile and confirm storage trie roots match expected values at pivot.

## Dependencies and ordering

- **No hard dependency on the spawned migration (#4240),** but after that migration this change becomes a self-contained modification inside `StorageActor`'s per-account handler. Before the migration, the change threads through the current dispatcher loop in `snap_sync.rs`.
- **Complementary to #6476** (small-account batching). They address opposite ends of the account distribution and are additive.
- **Benefits from but doesn't require #6170** (`StorageTrieTracker` refactor, already merged) — cleaner data ownership at the download layer, unrelated to trie build.
- **Reference implementation:** PR #6410 did this same transformation for the account trie. The storage trie case is the analogous within-trie parallelization.

## Instrumentation gaps worth closing first

The profiling that motivates this issue has limitations. Before investing in the optimization, consider:

- **No per-flush timing** — `flush_nodes_to_write` is where real disk writes happen; currently no instrumentation. Adding `flush_time` to `TrieInsertStats` would let us confirm whether we're I/O-bound or CPU-bound on the monster (current hypothesis: CPU-bound, 99% based on `io_wait`, but `io_wait` is actually buffer-wait, not disk I/O).
- **`io_wait` is misnamed** — currently means "time blocked on buffer pool recv", not disk I/O. Rename to `buffer_wait`.
- **No OS-level CPU util samples** — capture via `pidstat -p <pid> -u 1` in parallel during runs for ground-truth CPU utilization. Current `parallelism=8.1x` slightly overstates OS CPU usage for short tasks.
- **No `insert_accounts` aggregate line** (the phase PR #6410 parallelized internally) — useful for comparing phase utilization post-#6410.

Loosely tracked in the #6470 backlog.

## Experimental commits (starting point)

Three commits on branch `snapsync-roadmap` are the working starting point for the analogous account-trie change that shipped in #6410:

- `b82ba0e1b perf(l1): parallelize state trie building across 16 nibble ranges`
- `ac0709f40 fix(l1): use streaming RocksDB iterators for parallel trie building`
- `8dfa7f2e3 fix(l1): address review — flush code hashes incrementally, document thread count`

These apply to `insert_accounts` (the single big trie); the analogous `insert_storages` change would apply `trie_from_sorted_parallel` *inside* each large storage task.

## References

- Internal doc: `SNAP_SYNC_WORKSTREAMS.md` §5 "Further parallelization (big-account trie building)"
- Profiling branch: `perf/snap-sync-profiling` (based on #6470)
- Reference implementation: PR #6410 (`insert_accounts` parallelization)
- Complementary: #6476 (small-account batching — the 80% opportunity)
- Related but distinct: #5482 (block execution storage-update parallelization — different code path)
- Future context: #4240 (snap sync spawned rewrite — after this change relocates inside `StorageActor`)
- Download-layer refactor: #6170, #6140


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(l1): parallelize within-trie build for large storage tries in snap sync insert_storages #6477

Motivation

Current state

What PR #6410 did (for comparison)

Profiling baseline (mainnet run `20260412_172457`)

Per-account distribution: the "large" tail

Realistic savings estimate

Theoretical upper bound (16× split, perfect scaling)

Realistic (accounting for contention and uneven nibble distribution)

Thread-utilization perspective (the 80% finding)

Proposed implementation

Key design choices

Validation plan

Dependencies and ordering

Instrumentation gaps worth closing first

Experimental commits (starting point)

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Account prefix	Leaves	Wall	`cpu`	`io_wait` (buffer wait)	Flushes
`159e489…` (the monster)	159,600,000	244.9s	242.5s	2.5s	10,634
`ab14d68…`	23,700,000	34.5s	34.5s	0.0s	1,636
`1ff0800…`	17,700,000	32.7s	31.2s	1.5s	1,210
(4 more)	13.5–16.3M	17–23s each	≈wall	0.0s	—
(~25 more)	3–10M	5–15s each	≈wall	0.0s	—
rest	avg <100K	avg <1ms	—	—	—

perf(l1): parallelize within-trie build for large storage tries in snap sync insert_storages #6477

Description

Motivation

Current state

What PR #6410 did (for comparison)

Profiling baseline (mainnet run 20260412_172457)

Per-account distribution: the "large" tail

Realistic savings estimate

Theoretical upper bound (16× split, perfect scaling)

Realistic (accounting for contention and uneven nibble distribution)

Thread-utilization perspective (the 80% finding)

Proposed implementation

Key design choices

Validation plan

Dependencies and ordering

Instrumentation gaps worth closing first

Experimental commits (starting point)

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Profiling baseline (mainnet run `20260412_172457`)