You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During snap sync's insert_storages phase, a single large account (the "monster") runs single-threaded and dominates its own trie-build wall time. While the overall phase is already parallel across many accounts (16 worker threads, one task per account), the within-trie build of any individual account is sequential. For the handful of very large contracts on mainnet (Uniswap-class), this single-threaded path measurably impacts total sync time.
This issue tracks parallelizing the within-trie build for large storage tries in snap sync, analogous to what PR #6410 did for the single account trie via 16-nibble split.
Distinct from #5482: that issue is about block-execution merkelization (parallelizing per-tx storage updates for hot contracts). This issue is about snap sync's initial trie construction from downloaded storage slots. Similar 16-nibble idea, different code paths:
The per-account trie traversal inside trie_from_sorted_accounts_with_stats (trie_sorted.rs:189-237) runs on a single worker thread. Workers offload flush work to the shared pool via scope.execute_priority, so there's some parallelism at the flush layer — but the main traversal is sequential per account.
PR #6410 parallelized the account trie build (the single big trie of all 26M accounts) by splitting it across 16 first-nibble ranges via trie_from_sorted_parallel. That was a within-trie parallelization of the top-level state trie. Measured 30% improvement on insert_accounts.
This issue proposes the analogous change inside each large storage task: when the storage trie being built is big, split its construction across 16 storage-slot-nibble ranges.
Realistic (accounting for contention and uneven nibble distribution)
~100–150s saved, which is ~5–6% of the storage phase wall time (39m → ~36–37m).
Reasons it's less than the theoretical upper bound:
Uneven nibble distribution. Real storage slot hashes aren't uniformly distributed across 16 buckets for any given contract. Some nibbles will have 2–3× more leaves than others, capping speedup.
Shared buffer/flush pool contention. The 32-buffer pool is sized for the current parallelism. 16 extra concurrent subtasks all competing for buffers and flushes could cause buffer exhaustion (which currently is not a bottleneck — see the 2.5s io_wait on the monster — but that's with only one traversal running).
Other threads weren't idle during the monster anyway. When the monster runs on one thread, the other 15 threads are processing the long tail of smaller accounts. Their wall time during the monster's 245s is not "free" — they're consuming ~229s of work. So the monster's single-threaded run doesn't cost us a full 15×245 = 3,675 thread-seconds of idle time.
If the monster ran solo with 15 threads fully idle for 244.9s, that would account for 244.9 × 15 = 3,674 idle thread-seconds — only ~20% of the total idle time.
The other ~80% comes from dispatcher overhead on the small-account tail (26.3M accounts × avg <1ms, dispatcher blocked 69% of wall on slot turnover). That's tracked separately in #6476 (small-account batching) — higher-value opportunity than this issue on current profiling data.
Both fixes are additive:
This issue: reclaims a fraction of the 20% (monster's idle time)
Apply trie_from_sorted_parallel (from trie_sorted.rs, used in #6410) inside the per-account storage task when the account's leaf count exceeds a threshold:
// Sketch in insert_storages worker pathfnbuild_storage_trie(account_hash:H256,slots:implIterator<Item = (H256,U256)>){let count = /* estimate or pre-count */;if count > LARGE_STORAGE_THRESHOLD{// Split across 16 storage-slot-nibble ranges, parallel buildtrie_from_sorted_parallel(slots,&thread_pool,/* ... */);}else{// Current single-threaded pathtrie_from_sorted_accounts_with_stats(slots);}}
Key design choices
Threshold for "large." The distribution is highly bimodal (median <1ms, tail 5–245s), so the threshold isn't critical. Suggest ~1M leaves or ~3s historical time. Below that, parallelization overhead likely exceeds gain.
Pool sharing strategy.trie_from_sorted_parallel uses a thread pool for its 16 subtasks. The outer insert_storages dispatcher also uses a pool. Options:
Nested pool: outer 16-thread pool + inner N-thread pool per large task. Cleaner isolation but more threads total.
Adaptive: monster subtasks get pool priority (execute_priority) like flushes do today.
Buffer pool sizing. Currently 32 buffers. With 16 concurrent subtasks on the monster AND 15 other tasks on small accounts, that's 31 potentially concurrent users. May need to bump the pool or confirm buffer contention doesn't regress. Measure io_wait (buffer wait) on the monster before/after.
Validation plan
Baseline. Reproduce wall, parallelism, wait_time, and the monster's per-account stats on a mainnet sync with the current code.
Implement the nibble split for trie_from_sorted_parallel inside storage tasks, with LARGE_STORAGE_THRESHOLD chosen to catch the top ~30 accounts.
Compare per-account wall/cpu/io_wait for the monster and other large accounts before/after.
Verify no regression on the long tail. Measure wait_time (dispatcher blocked time) — if it increases, buffer/slot contention has regressed the small-account path.
Correctness. Run the full sync with release-with-debug-assertions profile and confirm storage trie roots match expected values at pivot.
Dependencies and ordering
No hard dependency on the spawned migration (Snapsync rewrite (also with spawned) #4240), but after that migration this change becomes a self-contained modification inside StorageActor's per-account handler. Before the migration, the change threads through the current dispatcher loop in snap_sync.rs.
The profiling that motivates this issue has limitations. Before investing in the optimization, consider:
No per-flush timing — flush_nodes_to_write is where real disk writes happen; currently no instrumentation. Adding flush_time to TrieInsertStats would let us confirm whether we're I/O-bound or CPU-bound on the monster (current hypothesis: CPU-bound, 99% based on io_wait, but io_wait is actually buffer-wait, not disk I/O).
io_wait is misnamed — currently means "time blocked on buffer pool recv", not disk I/O. Rename to buffer_wait.
No OS-level CPU util samples — capture via pidstat -p <pid> -u 1 in parallel during runs for ground-truth CPU utilization. Current parallelism=8.1x slightly overstates OS CPU usage for short tasks.
These apply to insert_accounts (the single big trie); the analogous insert_storages change would apply trie_from_sorted_parallelinside each large storage task.
Motivation
During snap sync's
insert_storagesphase, a single large account (the "monster") runs single-threaded and dominates its own trie-build wall time. While the overall phase is already parallel across many accounts (16 worker threads, one task per account), the within-trie build of any individual account is sequential. For the handful of very large contracts on mainnet (Uniswap-class), this single-threaded path measurably impacts total sync time.This issue tracks parallelizing the within-trie build for large storage tries in snap sync, analogous to what PR #6410 did for the single account trie via 16-nibble split.
Distinct from #5482: that issue is about block-execution merkelization (parallelizing per-tx storage updates for hot contracts). This issue is about snap sync's initial trie construction from downloaded storage slots. Similar 16-nibble idea, different code paths:
compute_state_root_with_updatespath, incremental updates after tx executioninsert_storagespath insnap_sync.rs, bulk insertion from downloadedStorageRangesresponsesCurrent state
insert_storagesdispatches work as follows:The per-account trie traversal inside
trie_from_sorted_accounts_with_stats(trie_sorted.rs:189-237) runs on a single worker thread. Workers offload flush work to the shared pool viascope.execute_priority, so there's some parallelism at the flush layer — but the main traversal is sequential per account.What PR #6410 did (for comparison)
PR #6410 parallelized the account trie build (the single big trie of all 26M accounts) by splitting it across 16 first-nibble ranges via
trie_from_sorted_parallel. That was a within-trie parallelization of the top-level state trie. Measured 30% improvement oninsert_accounts.This issue proposes the analogous change inside each large storage task: when the storage trie being built is big, split its construction across 16 storage-slot-nibble ranges.
Profiling baseline (mainnet run
20260412_172457)insert_storagesaggregate:Per-account distribution: the "large" tail
30 accounts out of 26.3M qualify as >5s wall time:
cpuio_wait(buffer wait)159e489…(the monster)ab14d68…1ff0800…Monster characteristics:
io_wait/buffer-wait = 2.5s of 245s)Instrumentation source:
TrieInsertStatson branchperf/snap-sync-profiling(not on main; based on #6470).Realistic savings estimate
Theoretical upper bound (16× split, perfect scaling)
Monster goes from 244.9s → ~15s → saves ~230s.
Realistic (accounting for contention and uneven nibble distribution)
~100–150s saved, which is ~5–6% of the storage phase wall time (39m → ~36–37m).
Reasons it's less than the theoretical upper bound:
io_waiton the monster — but that's with only one traversal running).Thread-utilization perspective (the 80% finding)
Total idle thread-seconds in the storage phase:
total_trie_cpu): 18,965 thread-secondsIf the monster ran solo with 15 threads fully idle for 244.9s, that would account for 244.9 × 15 = 3,674 idle thread-seconds — only ~20% of the total idle time.
The other ~80% comes from dispatcher overhead on the small-account tail (26.3M accounts × avg <1ms, dispatcher blocked 69% of wall on slot turnover). That's tracked separately in #6476 (small-account batching) — higher-value opportunity than this issue on current profiling data.
Both fixes are additive:
Proposed implementation
Apply
trie_from_sorted_parallel(fromtrie_sorted.rs, used in #6410) inside the per-account storage task when the account's leaf count exceeds a threshold:Key design choices
Threshold for "large." The distribution is highly bimodal (median <1ms, tail 5–245s), so the threshold isn't critical. Suggest ~1M leaves or ~3s historical time. Below that, parallelization overhead likely exceeds gain.
Pool sharing strategy.
trie_from_sorted_paralleluses a thread pool for its 16 subtasks. The outerinsert_storagesdispatcher also uses a pool. Options:insert_accounts): one 16-thread pool, subtasks and outer tasks contend for slots. Risk: monster subtasks starve the long tail.execute_priority) like flushes do today.Buffer pool sizing. Currently 32 buffers. With 16 concurrent subtasks on the monster AND 15 other tasks on small accounts, that's 31 potentially concurrent users. May need to bump the pool or confirm buffer contention doesn't regress. Measure
io_wait(buffer wait) on the monster before/after.Validation plan
wall,parallelism,wait_time, and the monster's per-account stats on a mainnet sync with the current code.trie_from_sorted_parallelinside storage tasks, withLARGE_STORAGE_THRESHOLDchosen to catch the top ~30 accounts.wall/cpu/io_waitfor the monster and other large accounts before/after.wait_time(dispatcher blocked time) — if it increases, buffer/slot contention has regressed the small-account path.release-with-debug-assertionsprofile and confirm storage trie roots match expected values at pivot.Dependencies and ordering
StorageActor's per-account handler. Before the migration, the change threads through the current dispatcher loop insnap_sync.rs.StorageTrieTrackerrefactor, already merged) — cleaner data ownership at the download layer, unrelated to trie build.Instrumentation gaps worth closing first
The profiling that motivates this issue has limitations. Before investing in the optimization, consider:
flush_nodes_to_writeis where real disk writes happen; currently no instrumentation. Addingflush_timetoTrieInsertStatswould let us confirm whether we're I/O-bound or CPU-bound on the monster (current hypothesis: CPU-bound, 99% based onio_wait, butio_waitis actually buffer-wait, not disk I/O).io_waitis misnamed — currently means "time blocked on buffer pool recv", not disk I/O. Rename tobuffer_wait.pidstat -p <pid> -u 1in parallel during runs for ground-truth CPU utilization. Currentparallelism=8.1xslightly overstates OS CPU usage for short tasks.insert_accountsaggregate line (the phase PR perf(l1): optimize trie building in snap sync insertion #6410 parallelized internally) — useful for comparing phase utilization post-perf(l1): optimize trie building in snap sync insertion #6410.Loosely tracked in the #6470 backlog.
Experimental commits (starting point)
Three commits on branch
snapsync-roadmapare the working starting point for the analogous account-trie change that shipped in #6410:b82ba0e1b perf(l1): parallelize state trie building across 16 nibble rangesac0709f40 fix(l1): use streaming RocksDB iterators for parallel trie building8dfa7f2e3 fix(l1): address review — flush code hashes incrementally, document thread countThese apply to
insert_accounts(the single big trie); the analogousinsert_storageschange would applytrie_from_sorted_parallelinside each large storage task.References
SNAP_SYNC_WORKSTREAMS.md§5 "Further parallelization (big-account trie building)"perf/snap-sync-profiling(based on feat(l1): add snap sync observability endpoints and tooling #6470)insert_accountsparallelization)StorageActor)request_storage_rangesfor correctness, readability, and performance #6140