Skip to content

perf(l1): parallelize within-trie build for large storage tries in snap sync insert_storages #6477

@ElFantasma

Description

@ElFantasma

Motivation

During snap sync's insert_storages phase, a single large account (the "monster") runs single-threaded and dominates its own trie-build wall time. While the overall phase is already parallel across many accounts (16 worker threads, one task per account), the within-trie build of any individual account is sequential. For the handful of very large contracts on mainnet (Uniswap-class), this single-threaded path measurably impacts total sync time.

This issue tracks parallelizing the within-trie build for large storage tries in snap sync, analogous to what PR #6410 did for the single account trie via 16-nibble split.

Distinct from #5482: that issue is about block-execution merkelization (parallelizing per-tx storage updates for hot contracts). This issue is about snap sync's initial trie construction from downloaded storage slots. Similar 16-nibble idea, different code paths:

  • Parallelize merkelization of storage slots #5482 (block execution): compute_state_root_with_updates path, incremental updates after tx execution
  • This issue (snap sync): insert_storages path in snap_sync.rs, bulk insertion from downloaded StorageRanges responses

Current state

insert_storages dispatches work as follows:

Dispatcher (1 thread) ──► 16-slot worker pool
  ├── Task: build_storage_trie(account_A) [thread 1]
  ├── Task: build_storage_trie(account_B) [thread 2]
  ├── ...
  └── Task: build_storage_trie(monster)   [thread 9, runs 245s alone]

The per-account trie traversal inside trie_from_sorted_accounts_with_stats (trie_sorted.rs:189-237) runs on a single worker thread. Workers offload flush work to the shared pool via scope.execute_priority, so there's some parallelism at the flush layer — but the main traversal is sequential per account.

What PR #6410 did (for comparison)

PR #6410 parallelized the account trie build (the single big trie of all 26M accounts) by splitting it across 16 first-nibble ranges via trie_from_sorted_parallel. That was a within-trie parallelization of the top-level state trie. Measured 30% improvement on insert_accounts.

This issue proposes the analogous change inside each large storage task: when the storage trie being built is big, split its construction across 16 storage-slot-nibble ranges.

Profiling baseline (mainnet run 20260412_172457)

insert_storages aggregate:

wall            = 2347.6s   (39m 07s)
total_trie_cpu  = 18965.1s  (sum across all tasks)
parallelism     = 8.1x      (of 16 threads → 50% utilization)
wait_time       = 1619.2s   (dispatcher blocked on recv, 69% of wall)
accounts        = 26,311,772
total_leaves    = 1,552,108,325
max_trie        = 244.9s    (monster)

Per-account distribution: the "large" tail

30 accounts out of 26.3M qualify as >5s wall time:

Account prefix Leaves Wall cpu io_wait (buffer wait) Flushes
159e489… (the monster) 159,600,000 244.9s 242.5s 2.5s 10,634
ab14d68… 23,700,000 34.5s 34.5s 0.0s 1,636
1ff0800… 17,700,000 32.7s 31.2s 1.5s 1,210
(4 more) 13.5–16.3M 17–23s each ≈wall 0.0s
(~25 more) 3–10M 5–15s each ≈wall 0.0s
rest avg <100K avg <1ms

Monster characteristics:

  • 99% CPU-bound (io_wait/buffer-wait = 2.5s of 245s)
  • 10,634 flushes over the 245s run (flush pool keeps up)
  • Most likely Uniswap v3 or similar high-activity contract based on leaf count

Instrumentation source: TrieInsertStats on branch perf/snap-sync-profiling (not on main; based on #6470).

Realistic savings estimate

Theoretical upper bound (16× split, perfect scaling)

Monster goes from 244.9s → ~15s → saves ~230s.

Realistic (accounting for contention and uneven nibble distribution)

~100–150s saved, which is ~5–6% of the storage phase wall time (39m → ~36–37m).

Reasons it's less than the theoretical upper bound:

  1. Uneven nibble distribution. Real storage slot hashes aren't uniformly distributed across 16 buckets for any given contract. Some nibbles will have 2–3× more leaves than others, capping speedup.
  2. Shared buffer/flush pool contention. The 32-buffer pool is sized for the current parallelism. 16 extra concurrent subtasks all competing for buffers and flushes could cause buffer exhaustion (which currently is not a bottleneck — see the 2.5s io_wait on the monster — but that's with only one traversal running).
  3. Other threads weren't idle during the monster anyway. When the monster runs on one thread, the other 15 threads are processing the long tail of smaller accounts. Their wall time during the monster's 245s is not "free" — they're consuming ~229s of work. So the monster's single-threaded run doesn't cost us a full 15×245 = 3,675 thread-seconds of idle time.

Thread-utilization perspective (the 80% finding)

Total idle thread-seconds in the storage phase:

  • Available: 16 threads × 2347s = 37,554 thread-seconds
  • Used (total_trie_cpu): 18,965 thread-seconds
  • Idle: 18,589 thread-seconds (49%)

If the monster ran solo with 15 threads fully idle for 244.9s, that would account for 244.9 × 15 = 3,674 idle thread-seconds — only ~20% of the total idle time.

The other ~80% comes from dispatcher overhead on the small-account tail (26.3M accounts × avg <1ms, dispatcher blocked 69% of wall on slot turnover). That's tracked separately in #6476 (small-account batching) — higher-value opportunity than this issue on current profiling data.

Both fixes are additive:

Proposed implementation

Apply trie_from_sorted_parallel (from trie_sorted.rs, used in #6410) inside the per-account storage task when the account's leaf count exceeds a threshold:

// Sketch in insert_storages worker path
fn build_storage_trie(account_hash: H256, slots: impl Iterator<Item = (H256, U256)>) {
    let count = /* estimate or pre-count */;
    if count > LARGE_STORAGE_THRESHOLD {
        // Split across 16 storage-slot-nibble ranges, parallel build
        trie_from_sorted_parallel(slots, &thread_pool, /* ... */);
    } else {
        // Current single-threaded path
        trie_from_sorted_accounts_with_stats(slots);
    }
}

Key design choices

  1. Threshold for "large." The distribution is highly bimodal (median <1ms, tail 5–245s), so the threshold isn't critical. Suggest ~1M leaves or ~3s historical time. Below that, parallelization overhead likely exceeds gain.

  2. Pool sharing strategy. trie_from_sorted_parallel uses a thread pool for its 16 subtasks. The outer insert_storages dispatcher also uses a pool. Options:

    • Shared pool (what PR perf(l1): optimize trie building in snap sync insertion #6410 does for insert_accounts): one 16-thread pool, subtasks and outer tasks contend for slots. Risk: monster subtasks starve the long tail.
    • Nested pool: outer 16-thread pool + inner N-thread pool per large task. Cleaner isolation but more threads total.
    • Adaptive: monster subtasks get pool priority (execute_priority) like flushes do today.
  3. Buffer pool sizing. Currently 32 buffers. With 16 concurrent subtasks on the monster AND 15 other tasks on small accounts, that's 31 potentially concurrent users. May need to bump the pool or confirm buffer contention doesn't regress. Measure io_wait (buffer wait) on the monster before/after.

Validation plan

  1. Baseline. Reproduce wall, parallelism, wait_time, and the monster's per-account stats on a mainnet sync with the current code.
  2. Implement the nibble split for trie_from_sorted_parallel inside storage tasks, with LARGE_STORAGE_THRESHOLD chosen to catch the top ~30 accounts.
  3. Compare per-account wall/cpu/io_wait for the monster and other large accounts before/after.
  4. Verify no regression on the long tail. Measure wait_time (dispatcher blocked time) — if it increases, buffer/slot contention has regressed the small-account path.
  5. Correctness. Run the full sync with release-with-debug-assertions profile and confirm storage trie roots match expected values at pivot.

Dependencies and ordering

Instrumentation gaps worth closing first

The profiling that motivates this issue has limitations. Before investing in the optimization, consider:

  • No per-flush timingflush_nodes_to_write is where real disk writes happen; currently no instrumentation. Adding flush_time to TrieInsertStats would let us confirm whether we're I/O-bound or CPU-bound on the monster (current hypothesis: CPU-bound, 99% based on io_wait, but io_wait is actually buffer-wait, not disk I/O).
  • io_wait is misnamed — currently means "time blocked on buffer pool recv", not disk I/O. Rename to buffer_wait.
  • No OS-level CPU util samples — capture via pidstat -p <pid> -u 1 in parallel during runs for ground-truth CPU utilization. Current parallelism=8.1x slightly overstates OS CPU usage for short tasks.
  • No insert_accounts aggregate line (the phase PR perf(l1): optimize trie building in snap sync insertion #6410 parallelized internally) — useful for comparing phase utilization post-perf(l1): optimize trie building in snap sync insertion #6410.

Loosely tracked in the #6470 backlog.

Experimental commits (starting point)

Three commits on branch snapsync-roadmap are the working starting point for the analogous account-trie change that shipped in #6410:

  • b82ba0e1b perf(l1): parallelize state trie building across 16 nibble ranges
  • ac0709f40 fix(l1): use streaming RocksDB iterators for parallel trie building
  • 8dfa7f2e3 fix(l1): address review — flush code hashes incrementally, document thread count

These apply to insert_accounts (the single big trie); the analogous insert_storages change would apply trie_from_sorted_parallel inside each large storage task.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    L1Ethereum clientperformanceBlock execution throughput and performance in general

    Type

    No type

    Projects

    Status

    No status

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions