Bundle of independent improvements to the BAL parallel-execution path
(execute_block_parallel + handle_merkleization_bal + warm_block_from_bal +
CachingDatabase), validated against a 149-block stress fixture (100M gas,
200-500 tx/block, ~25M-gas median blocks).
Headline (per-block medians):
Metric Sequential Parallel(no bundle) + bundle vs seq vs par-base
Ggas/s 1.78 2.88 3.64 +104.3% +26.4%
total (ms) 23.86 14.43 11.44 -52.1% -20.7%
exec (ms) 21.97 12.94 6.67 -69.6% -48.5%
warmer (ms) 7.41 5.39 3.93 -47.0% -27.1%
store (ms) 1.60 1.19 1.25 -21.9% +5.0%
Bundle doubles the speedup margin the parallel path was already providing
over sequential.
The changes (each is independently shippable; combined here for atomic
review since they touch overlapping code):
A. handle_merkleization_bal overlap fix (crates/blockchain/blockchain.rs)
`for updates in rx { ... }` blocked until channel close (= exec end).
execute_block_parallel sends exactly one batch up front from
bal_to_account_updates, so draining nothing useful serialized Stage B
(parallel storage roots) after exec instead of overlapping with it.
Replaced with a single rx.recv() and dropped the FxHashMap merge step
(BAL guarantees one entry per address).
B. Adaptive threshold for BAL parallel exec (crates/vm/backends/levm/mod.rs)
Added BAL_PARALLEL_TX_THRESHOLD = 5. Below threshold falls through to
the sequential path which produces a BAL during exec; blockchain.rs
hash-compares produced vs header BAL — same correctness, no parallel
constants. Mirrors reth's SMALL_BLOCK_TX_THRESHOLD; trips on <1% of
mainnet blocks (100-block sample).
C. import-bench inter-block sleep 500ms -> 100ms (cmd/ethrex/cli.rs)
Bench tooling change. The sleep gates background trie-layer writeback
from bleeding into the next block's per-block timer; 100ms is well
above measured Phase 2 cost on SSD. Cuts bench wall clock 80% without
affecting the per-block metric. NO effect on production paths.
Q1. Skip prestate read in bal_to_account_updates when BAL covers all info
fields (crates/vm/backends/levm/mod.rs). Two fast paths added:
storage-only updates (info: None, removed: false by construction);
full info coverage with non-empty post (removal impossible, info from
BAL alone). Slow path keeps existing behavior for partial coverage.
Q2. Per-tx GeneralizedDatabase capacity cap at 32
(crates/vm/backends/levm/mod.rs::execute_block_parallel). Previously
sized to bal.accounts().len() (often 100s on stress blocks); p50 tx
touches <10 accounts. Reduced allocator pressure across rayon workers.
Q3. Memoize code_from_bal results across seed_db_from_bal calls
(crates/vm/backends/levm/mod.rs). Pre-compute Code objects (hash +
jump_targets) once per BAL code change before the par_iter; pass cache
via optional param to seed_db_from_bal. Saves N-1 keccak+jump-target
scans per code change per block (N = tx count).
Q8. Move per-tx BAL validation into the rayon par_iter closure
(crates/vm/backends/levm/mod.rs::execute_block_parallel). Eliminates a
serial post-exec validation pass (~3 ms median across 200 txs). Drops
current_state and codes inside the closure after validation runs —
they no longer cross the rayon boundary, reducing per-tx allocator
pressure. Closure returns deferred Option<EvmError> so gas-limit check
still takes priority over BAL mismatch errors.
DashMap. CachingDatabase RwLock<HashMap> -> DashMap<_, _, FxBuildHasher>
(crates/vm/levm/src/db/mod.rs). Found via perf record: 11% of CPU was
RwLock::read_contended on the single account RwLock with 16 rayon
workers hammering it. Sharded concurrent map (64 default shards)
eliminates contention. Sequential paths unaffected (only 2 threads
access the cache, weren't contended).
Effect on non-BAL paths (block production, pre-Amsterdam, sequential
fallback): DashMap is neutral (low contention), threshold-fallback adds a
protective branch, other changes only fire on the BAL parallel-validation
path. No regressions in non-parallel paths.
Summary
Bundle of independent improvements to the BAL parallel-execution path, validated against a 149-block stress fixture (100M gas, 200–500 tx/block, ~25M-gas median blocks).
The bundle doubles the speedup margin the parallel path was already providing over sequential.
What's in the bundle
Each change is independently shippable; combined here for atomic review since they touch overlapping code in `execute_block_parallel`.
Effect on non-BAL paths
Tried-and-rejected (documented for context)
Test plan