Skip to content

perf(l1): reduce BAL parallel-path overhead#6543

Draft
edg-l wants to merge 1 commit intobal-devnet-4from
perf/bal-parallel-overhead
Draft

perf(l1): reduce BAL parallel-path overhead#6543
edg-l wants to merge 1 commit intobal-devnet-4from
perf/bal-parallel-overhead

Conversation

@edg-l
Copy link
Copy Markdown
Contributor

@edg-l edg-l commented Apr 28, 2026

Summary

Bundle of independent improvements to the BAL parallel-execution path, validated against a 149-block stress fixture (100M gas, 200–500 tx/block, ~25M-gas median blocks).

Metric (median) Sequential Parallel (no bundle) Parallel + bundle vs sequential vs parallel-no-bundle
Ggas/s 1.78 2.88 3.64 +104.3% +26.4%
total (ms) 23.86 14.43 11.44 −52.1% −20.7%
exec (ms) 21.97 12.94 6.67 −69.6% −48.5%
warmer (ms) 7.41 5.39 3.93 −47.0% −27.1%
store (ms) 1.60 1.19 1.25 −21.9% +5.0%

The bundle doubles the speedup margin the parallel path was already providing over sequential.

What's in the bundle

Each change is independently shippable; combined here for atomic review since they touch overlapping code in `execute_block_parallel`.

  • A. handle_merkleization_bal overlap fix (`blockchain.rs`) — replace channel drain-loop with single `recv()`. Stage B (parallel storage roots) now overlaps with exec instead of serializing after it.
  • B. Adaptive threshold `BAL_PARALLEL_TX_THRESHOLD = 5` — below threshold falls through to sequential exec (which produces a BAL during exec; `blockchain.rs` hash-compares against header). Mirrors reth's `SMALL_BLOCK_TX_THRESHOLD`.
  • C. import-bench inter-block sleep 500ms → 100ms (bench tooling change, no production effect) — cuts bench wall-clock 80%.
  • Q1. Skip prestate read in `bal_to_account_updates` when BAL covers all info fields. Two fast paths: storage-only updates and full-info-coverage with non-empty post.
  • Q2. Per-tx `GeneralizedDatabase` capacity cap @32 (was sized to full BAL account count, often 100s; p50 tx touches <10).
  • Q3. Memoize `code_from_bal` results — pre-compute Code objects (hash + jump_targets) once per BAL code change before the par_iter; pass cache via optional param to `seed_db_from_bal`.
  • Q8. Move per-tx BAL validation into the rayon par_iter closure — eliminates a serial post-exec validation pass; drops `current_state`/`codes` inside the closure (no longer cross rayon boundary).
  • DashMap swap in `CachingDatabase` — perf record showed 11% of CPU in `RwLock::read_contended` with 16 rayon workers hammering the single account RwLock. Replaced with sharded `DashMap<_, _, FxBuildHasher>`. Sequential paths unaffected (only 2 threads, weren't contended).

Effect on non-BAL paths

  • Block production / pre-Amsterdam / sequential fallback: DashMap is neutral (low contention); threshold-fallback adds a protective branch; other changes only fire on the BAL parallel-validation path.
  • No regressions in non-parallel paths.

Tried-and-rejected (documented for context)

  • Drop `accessed_accounts` tracker: not actually redundant — superset/subset of shadow recorder, distinct correctness roles.
  • `rayon::join` warmer Phase 2 + Phase 3: nested rayon on shared pool starved exec workers (−12%), warmer didn't speed up (already I/O-bound saturating internal par_iter).
  • Validation-only BAL recorder: exec saved 5%, but those savings shifted to "after exec" merkle drain — net per-block flat. Once exec < merkle wall-clock, exec-side savings have diminishing returns on per-block time.

Test plan

  • `cargo check -p ethrex-blockchain -p ethrex-levm -p ethrex-vm` (clean)
  • Stress fixture (149 blocks, 100M gas, mainnet-shape): per-block medians match the table above
  • Hive Amsterdam consume-engine
  • EF blockchain tests (BAL fixtures `bal@v6.0.0`)

@github-actions github-actions Bot added L1 Ethereum client performance Block execution throughput and performance in general labels Apr 28, 2026
@github-actions
Copy link
Copy Markdown

Lines of code report

Total lines added: 69
Total lines removed: 34
Total lines changed: 103

Detailed view
+----------------------------------------+-------+------+
| File                                   | Lines | Diff |
+----------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs | 2482  | -5   |
+----------------------------------------+-------+------+
| ethrex/crates/vm/backends/levm/mod.rs  | 2426  | +69  |
+----------------------------------------+-------+------+
| ethrex/crates/vm/levm/src/db/mod.rs    | 119   | -29  |
+----------------------------------------+-------+------+

Bundle of independent improvements to the BAL parallel-execution path
(execute_block_parallel + handle_merkleization_bal + warm_block_from_bal +
CachingDatabase), validated against a 149-block stress fixture (100M gas,
200-500 tx/block, ~25M-gas median blocks).

Headline (per-block medians):

  Metric        Sequential  Parallel(no bundle)  + bundle  vs seq    vs par-base
  Ggas/s        1.78        2.88                 3.64      +104.3%   +26.4%
  total (ms)    23.86       14.43                11.44     -52.1%    -20.7%
  exec (ms)     21.97       12.94                6.67      -69.6%    -48.5%
  warmer (ms)   7.41        5.39                 3.93      -47.0%    -27.1%
  store (ms)    1.60        1.19                 1.25      -21.9%    +5.0%

Bundle doubles the speedup margin the parallel path was already providing
over sequential.

The changes (each is independently shippable; combined here for atomic
review since they touch overlapping code):

A. handle_merkleization_bal overlap fix (crates/blockchain/blockchain.rs)
   `for updates in rx { ... }` blocked until channel close (= exec end).
   execute_block_parallel sends exactly one batch up front from
   bal_to_account_updates, so draining nothing useful serialized Stage B
   (parallel storage roots) after exec instead of overlapping with it.
   Replaced with a single rx.recv() and dropped the FxHashMap merge step
   (BAL guarantees one entry per address).

B. Adaptive threshold for BAL parallel exec (crates/vm/backends/levm/mod.rs)
   Added BAL_PARALLEL_TX_THRESHOLD = 5. Below threshold falls through to
   the sequential path which produces a BAL during exec; blockchain.rs
   hash-compares produced vs header BAL — same correctness, no parallel
   constants. Mirrors reth's SMALL_BLOCK_TX_THRESHOLD; trips on <1% of
   mainnet blocks (100-block sample).

C. import-bench inter-block sleep 500ms -> 100ms (cmd/ethrex/cli.rs)
   Bench tooling change. The sleep gates background trie-layer writeback
   from bleeding into the next block's per-block timer; 100ms is well
   above measured Phase 2 cost on SSD. Cuts bench wall clock 80% without
   affecting the per-block metric. NO effect on production paths.

Q1. Skip prestate read in bal_to_account_updates when BAL covers all info
    fields (crates/vm/backends/levm/mod.rs). Two fast paths added:
    storage-only updates (info: None, removed: false by construction);
    full info coverage with non-empty post (removal impossible, info from
    BAL alone). Slow path keeps existing behavior for partial coverage.

Q2. Per-tx GeneralizedDatabase capacity cap at 32
    (crates/vm/backends/levm/mod.rs::execute_block_parallel). Previously
    sized to bal.accounts().len() (often 100s on stress blocks); p50 tx
    touches <10 accounts. Reduced allocator pressure across rayon workers.

Q3. Memoize code_from_bal results across seed_db_from_bal calls
    (crates/vm/backends/levm/mod.rs). Pre-compute Code objects (hash +
    jump_targets) once per BAL code change before the par_iter; pass cache
    via optional param to seed_db_from_bal. Saves N-1 keccak+jump-target
    scans per code change per block (N = tx count).

Q8. Move per-tx BAL validation into the rayon par_iter closure
    (crates/vm/backends/levm/mod.rs::execute_block_parallel). Eliminates a
    serial post-exec validation pass (~3 ms median across 200 txs). Drops
    current_state and codes inside the closure after validation runs —
    they no longer cross the rayon boundary, reducing per-tx allocator
    pressure. Closure returns deferred Option<EvmError> so gas-limit check
    still takes priority over BAL mismatch errors.

DashMap. CachingDatabase RwLock<HashMap> -> DashMap<_, _, FxBuildHasher>
    (crates/vm/levm/src/db/mod.rs). Found via perf record: 11% of CPU was
    RwLock::read_contended on the single account RwLock with 16 rayon
    workers hammering it. Sharded concurrent map (64 default shards)
    eliminates contention. Sequential paths unaffected (only 2 threads
    access the cache, weren't contended).

Effect on non-BAL paths (block production, pre-Amsterdam, sequential
fallback): DashMap is neutral (low contention), threshold-fallback adds a
protective branch, other changes only fire on the BAL parallel-validation
path. No regressions in non-parallel paths.
@edg-l edg-l force-pushed the perf/bal-parallel-overhead branch from 203f859 to 1e3ac87 Compare April 28, 2026 13:35
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

Benchmark Results Comparison

Benchmark Results: MstoreBench

Command Mean [s] Min [s] Max [s] Relative
main_revm_MstoreBench 262.0 ± 4.6 258.5 274.8 1.14 ± 0.02
main_levm_MstoreBench 266.8 ± 101.4 231.0 555.1 1.16 ± 0.44
pr_levm_MstoreBench 230.4 ± 1.2 228.5 232.7 1.00
Detailed Results

Benchmark Results: BubbleSort

Command Mean [s] Min [s] Max [s] Relative
main_revm_BubbleSort 3.017 ± 0.020 2.985 3.049 1.12 ± 0.02
main_levm_BubbleSort 2.696 ± 0.033 2.672 2.784 1.00 ± 0.02
pr_revm_BubbleSort 3.000 ± 0.019 2.971 3.023 1.11 ± 0.02
pr_levm_BubbleSort 2.694 ± 0.042 2.664 2.802 1.00

Benchmark Results: ERC20Approval

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Approval 988.2 ± 5.9 981.7 1000.8 1.01 ± 0.01
main_levm_ERC20Approval 1022.2 ± 8.3 1010.8 1037.0 1.04 ± 0.01
pr_revm_ERC20Approval 982.7 ± 8.4 975.0 1000.5 1.00
pr_levm_ERC20Approval 1027.1 ± 16.2 1008.2 1063.7 1.05 ± 0.02

Benchmark Results: ERC20Mint

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Mint 135.0 ± 0.8 133.8 136.2 1.01 ± 0.01
main_levm_ERC20Mint 148.3 ± 0.5 147.8 149.0 1.11 ± 0.01
pr_revm_ERC20Mint 133.8 ± 0.5 132.6 134.4 1.00
pr_levm_ERC20Mint 147.9 ± 0.6 147.1 148.8 1.11 ± 0.01

Benchmark Results: ERC20Transfer

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Transfer 234.2 ± 1.9 232.4 237.9 1.01 ± 0.01
main_levm_ERC20Transfer 252.4 ± 1.0 250.8 253.9 1.09 ± 0.01
pr_revm_ERC20Transfer 231.5 ± 1.1 229.9 233.0 1.00
pr_levm_ERC20Transfer 251.2 ± 2.1 248.7 255.9 1.09 ± 0.01

Benchmark Results: Factorial

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Factorial 223.5 ± 0.9 221.6 224.4 1.00
main_levm_Factorial 250.3 ± 2.3 247.6 254.6 1.12 ± 0.01
pr_revm_Factorial 224.7 ± 1.5 223.7 228.9 1.01 ± 0.01
pr_levm_Factorial 247.5 ± 2.4 244.8 253.4 1.11 ± 0.01

Benchmark Results: FactorialRecursive

Command Mean [s] Min [s] Max [s] Relative
main_revm_FactorialRecursive 1.621 ± 0.029 1.571 1.662 1.01 ± 0.02
main_levm_FactorialRecursive 9.135 ± 0.057 9.040 9.235 5.72 ± 0.09
pr_revm_FactorialRecursive 1.597 ± 0.024 1.561 1.642 1.00
pr_levm_FactorialRecursive 9.211 ± 0.026 9.167 9.256 5.77 ± 0.09

Benchmark Results: Fibonacci

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Fibonacci 205.0 ± 1.3 203.7 207.4 1.00 ± 0.01
main_levm_Fibonacci 233.5 ± 6.2 225.0 245.9 1.14 ± 0.03
pr_revm_Fibonacci 204.8 ± 1.1 203.5 207.0 1.00
pr_levm_Fibonacci 225.3 ± 4.3 220.1 232.8 1.10 ± 0.02

Benchmark Results: FibonacciRecursive

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_FibonacciRecursive 837.6 ± 7.4 824.5 848.2 1.34 ± 0.02
main_levm_FibonacciRecursive 624.6 ± 9.5 609.7 642.9 1.00 ± 0.02
pr_revm_FibonacciRecursive 845.3 ± 17.3 825.1 878.7 1.35 ± 0.03
pr_levm_FibonacciRecursive 624.3 ± 7.8 615.5 642.0 1.00

Benchmark Results: ManyHashes

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ManyHashes 8.4 ± 0.1 8.3 8.6 1.02 ± 0.01
main_levm_ManyHashes 9.8 ± 0.1 9.7 9.9 1.19 ± 0.01
pr_revm_ManyHashes 8.3 ± 0.0 8.2 8.3 1.00
pr_levm_ManyHashes 9.7 ± 0.1 9.7 9.8 1.17 ± 0.01

Benchmark Results: MstoreBench

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_MstoreBench 262.0 ± 4.6 258.5 274.8 1.14 ± 0.02
main_levm_MstoreBench 266.8 ± 101.4 231.0 555.1 1.16 ± 0.44
pr_revm_MstoreBench 263.6 ± 4.4 260.1 272.8 1.14 ± 0.02
pr_levm_MstoreBench 230.4 ± 1.2 228.5 232.7 1.00

Benchmark Results: Push

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Push 288.4 ± 1.4 286.9 290.9 1.02 ± 0.01
main_levm_Push 284.1 ± 1.5 282.0 286.2 1.00
pr_revm_Push 289.0 ± 0.8 288.1 290.4 1.02 ± 0.01
pr_levm_Push 285.9 ± 4.5 282.6 296.5 1.01 ± 0.02

Benchmark Results: SstoreBench_no_opt

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_SstoreBench_no_opt 171.4 ± 2.5 167.3 175.4 1.72 ± 0.03
main_levm_SstoreBench_no_opt 99.7 ± 0.4 99.2 100.5 1.00
pr_revm_SstoreBench_no_opt 172.9 ± 3.1 170.6 181.4 1.73 ± 0.03
pr_levm_SstoreBench_no_opt 99.8 ± 0.3 99.1 100.2 1.00 ± 0.01

@github-actions
Copy link
Copy Markdown

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 65.977 ± 0.197 65.739 66.401 1.00 ± 0.00
head 65.784 ± 0.105 65.599 65.913 1.00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: No status
Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant