Skip to content

perf(history_map): box elements, pointer-only pending list#668

Draft
0xVolosnikov wants to merge 2 commits into
draft-0.4.0from
vv/history-map-box
Draft

perf(history_map): box elements, pointer-only pending list#668
0xVolosnikov wants to merge 2 commits into
draft-0.4.0from
vv/history-map-box

Conversation

@0xVolosnikov
Copy link
Copy Markdown
Contributor

What ❔

Wraps each ElementWithHistory in Box so its address is stable across BTreeMap node splits, and stores NonNull<ElementWithHistory> in the pending-updates list instead of cloned keys. K is embedded inside ElementWithHistory so iterators/callbacks that need a key still get one without a BTreeMap lookup.

Affected:

  • zk_ee/src/common_structs/history_map/mod.rs
  • zk_ee/src/common_structs/history_map/element_with_history.rs

Why ❔

HistoryMap is on the hot bookkeeping path of storage cache, account cache, transient storage, and preimage publication. Every non-coalesced update() cloned K (up to 52 B for WarmStorageKey) into a StackLinkedList<(K, CacheSnapshotId), A>; every rollback / commit / iter_altered_since_commit / apply_to_last_record_of_pending_changes then re-resolved each entry via BTreeMap::get(&K).

With this change:

  • update() no longer clones K; pending entry shrinks from (K, snap) to (NonNull, snap) = 16 B.
  • Rollback / commit / pending-iter paths bypass the BTreeMap descent entirely.
  • BTreeMap lookups on warm reads/writes (get, get_mut, get_or_insert) are unchanged — this targets the bookkeeping paths, not steady-state EVM dispatch.

Cost: one Box::new_in per unique key inserted in the map's lifetime, plus an extra K copy embedded inside ElementWithHistory (in addition to the BTreeMap key copy).

Benchmark results

Branch vs draft-0.4.0 (commit d62b8ad2):

Block Base eff Head eff Δ
19299001 208,601,280 204,530,358 −1.95 %
22244135 134,741,585 132,440,306 −1.71 %

Raw cycles drop −2.57 % / −2.17 %; Blake/Bigint/Keccak delegation counts unchanged. compare_opcode_stats.py reports no per-opcode regressions, consistent with the change living outside the EVM hot loop.

Is this a breaking change?

  • Yes
  • No

Public API of HistoryMap, HistoryMapItemRef, HistoryMapItemRefMut is preserved. Internal ElementWithHistory gains a K type parameter, but it isn't constructed outside this module.

Notes / follow-ups

  • Soundness rests on the fact that elements never leave the Box until HistoryMap::clear(), which also resets the pending list — so pointers stored in pending are valid for their lifetime there. The unsafe blocks are annotated.
  • Pre-existing MIRI Tree-Borrows finding in element_pool.rs (NonNull::from_ref followed by later as_mut) is present on draft-0.4.0 too; not addressed here.
  • Option B (arena-allocate ElementWithHistory from a ListVec pool like the existing ElementPool) would amortize allocations and is the natural next step if Box per unique key shows up — left unimplemented for now.

Checklist

  • PR title corresponds to the body of PR.
  • Tests for the changes have been added / updated. (Existing miri_* and rig-style HistoryMap tests cover the new code paths; internal ElementWithHistory tests updated for the new K type parameter.)
  • Documentation comments have been added / updated.
  • Code has been formatted.

0xVolosnikov added a commit that referenced this pull request May 20, 2026
## What ❔

Replaces `Bytes32`'s `Ord` impl on **the RISC-V proving target only**
with a word-by-word equality scan over the underlying `inner: [usize;
BYTES32_USIZE_SIZE]` (N=4 on 64-bit, 8 on RISC-V32):

- Equal-prefix iterations stay in the fast path: load word, compare
word, branch.
- On the first differing word, resolve byte-lex order by walking the
bytes of just that word — cheaper than `swap_bytes()` on RV32 without
the Zbb extension (which this target does not enable).
- `cmp` / `partial_cmp` are `#[inline]` since they sit on the BTreeMap
descent hot path.

On non-RV32 targets (forward / sequencer host) the impl falls back to
the original `as_u8_array_ref().cmp(...)`. libc's `memcmp` is already
SIMD-vectorized (SSE2/AVX/NEON), so the byte path beats a pure-Rust word
loop on the host. The chunked helper (`cmp_word_chunked`) is still
compiled on every target so the equivalence tests can validate it on the
host.

Affected:
- `zk_ee/src/utils/bytes32.rs`

## Why ❔

From the flamegraph analysis on `draft-0.4.0`:

- `impls::compare_bytes` is the **second-highest self-cost function in
the binary at 6.47 %**. On no-std RISC-V the `<[u8]>::cmp` path falls
back to `compiler-builtins`' generic `memcmp`
(`compiler-builtins/src/mem/impls.rs:388`) — a plain byte loop with no
chunking.
- Almost all of that self-cost is reached through `Bytes32::cmp`
(directly, or via `WarmStorageKey::cmp` → `Bytes32::cmp`) sitting inside
`NodeRef::find_key_index` in HistoryMap / preimage /
FlatStorageCommitment lookups.

Comparing word-aligned 32-byte values byte-by-byte is wasteful when the
struct is already a `[usize; N]` with `align(8)`. Forward mode doesn't
need the rewrite because libc memcmp already handles it efficiently.

## Benchmark results

Branch vs `origin/draft-0.4.0` (commit `59bdedfa`):

| Block | Base eff | Head eff | Δ eff | Δ raw |
|---|---|---|---|---|
| `19299001` | 208,601,280 | 204,648,699 | **−1.89 %** | **−2.49 %** |
| `22244135` | 134,741,585 | 132,504,954 | **−1.66 %** | **−2.10 %** |

Blake / Bigint / Keccak delegation counts unchanged in both runs — pure
raw-cycle reduction.

### Variant comparison (RV32 only — host path is unchanged)

| Variant | block 19299001 Δ eff | block 22244135 Δ eff |
|---|---|---|
| word-cmp + `to_be()` bswap on mismatch | −1.56 % | −1.39 % |
| **word-cmp + `#[inline]` + byte-fallback (this PR)** | **−1.89 %** |
**−1.66 %** |
| word-cmp + 2-word XOR-OR fusion | −1.68 % | −1.45 % |

The 2-word XOR-OR variant regresses vs the simple loop — the extra
arithmetic per executed iteration doesn't pay for the halved branch
count on RV32. Picked the byte-fallback variant.

## Is this a breaking change?
- [ ] Yes
- [x] No

Public API of `Bytes32` is unchanged. `Ord` / `PartialOrd` impls produce
identical orderings to the previous implementation on every target
(verified by `cmp_tests::cmp_matches_byte_lex_on_pseudorandom_pairs`,
which invokes `cmp_word_chunked` directly so the helper is exercised on
the host too).

## Stackability with #668 / #669

This PR targets the leaf `compare_bytes` cost; #668 / #669 target the
BTreeMap lookups themselves on the pending-list paths. They hit
different code regions and are expected to compose (overlap only where
BTreeMap node descent invokes `Bytes32::cmp` — and even there the
per-compare cost goes down).

## Checklist

- [x] PR title corresponds to the body of PR.
- [x] Tests for the changes have been added / updated. (Equivalence
tests added: `cmp_matches_byte_lex_on_handcrafted_pairs`,
`cmp_matches_byte_lex_on_pseudorandom_pairs`. Both invoke
`cmp_word_chunked` directly so the chunked path is exercised on the
host.)
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0xVolosnikov and others added 2 commits May 20, 2026 08:03
Wrap BTreeMap values in Box for stable addresses, embed K inside
ElementWithHistory, and store NonNull<ElementWithHistory> in the
pending-updates list instead of cloned keys. Bypasses BTreeMap lookups
on rollback/commit/iter-pending paths and removes K::clone() per
update().

Experimental — for benchmarking only.
`cargo clippy --workspace -- -D warnings` flagged the
`btree.iter().map(|(_k, v)| ...)` pattern in HistoryMap::iter as
`clippy::iter_kv_map`. Replace with `btree.values().map(|v| ...)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes HistoryMap bookkeeping by ensuring ElementWithHistory has a stable address (boxing values stored in the BTreeMap) and by changing the pending-updates list to store raw pointers (NonNull) instead of cloned keys, avoiding repeated key clones and BTreeMap::get() lookups on rollback/commit/pending-iteration paths.

Changes:

  • Store Box<ElementWithHistory<...>> as the BTreeMap value to keep element addresses stable across node splits.
  • Change pending_updated_elements entries from (K, CacheSnapshotId) to (NonNull<ElementWithHistory>, CacheSnapshotId) and update rollback/commit/iter paths to dereference pointers.
  • Embed a copy of K inside ElementWithHistory so iterators/callbacks can still surface keys without map lookups.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
zk_ee/src/common_structs/history_map/mod.rs Boxes map values and switches the pending list to pointer-based entries; updates rollback/commit/iteration logic accordingly.
zk_ee/src/common_structs/history_map/element_with_history.rs Adds embedded key: K to ElementWithHistory and updates construction/tests for the new type parameter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

use alloc::collections::btree_map::Entry;
use alloc::collections::BTreeMap;
use core::{alloc::Allocator, fmt::Debug, ops::Bound};
use core::{alloc::Allocator, fmt::Debug, ops::Bound, ptr::NonNull};
Comment on lines 14 to 22
/// The history linked list. Always has at least one item with the snapshot id of 0.
pub struct ElementWithHistory<V, A: Allocator + Clone, EP = ()> {
///
/// `key` is embedded so that the pending-updates list can store a stable pointer
/// to an `ElementWithHistory` and still surface the key on iteration, avoiding
/// a `BTreeMap::get(&K)` lookup per pending entry.
pub struct ElementWithHistory<K, V, A: Allocator + Clone, EP = ()> {
/// Key owned by this element (separate from the BTreeMap key copy).
pub key: K,
/// Additional properties associated with the element globally.
@github-actions
Copy link
Copy Markdown
Contributor

Block-level effective cycles

Benchmark Symbol Base Eff Head Eff (%) Base Raw Head Raw (%) Base Blake Head Blake (%) Base Bigint Head Bigint (%) Base Keccak Head Keccak (%)
block_19299001 (keccak DA) process_block 204,573,864 202,376,775 (-1.07%) 154,504,016 152,306,927 (-1.42%) 410,630 410,630 (+0.00%) 7,681,862 7,681,862 (+0.00%) 3,193,080 3,193,080 (+0.00%)
block_19299001 (blobs DA) process_block 253,334,505 251,139,860 (-0.87%) 191,623,089 189,428,444 (-1.15%) 414,340 414,340 (+0.00%) 10,690,989 10,690,989 (+0.00%) 3,079,505 3,079,505 (+0.00%)
block_22244135 (keccak DA) process_block 132,455,129 131,178,557 (-0.96%) 103,982,317 102,705,745 (-1.23%) 172,040 172,040 (+0.00%) 5,054,163 5,054,163 (+0.00%) 1,375,880 1,375,880 (+0.00%)
block_22244135 (blobs DA) process_block 181,850,772 180,572,647 (-0.70%) 141,470,644 140,192,519 (-0.90%) 174,090 174,090 (+0.00%) 8,085,096 8,085,096 (+0.00%) 1,313,576 1,313,576 (+0.00%)
Block-level sub-phases
Benchmark Symbol Base Eff Head Eff (%) Base Raw Head Raw (%) Base Blake Head Blake (%) Base Bigint Head Bigint (%) Base Keccak Head Keccak (%)
block_19299001 (blobs DA) blob_versioned_hash 49,568,581 49,568,581 (+0.00%) 37,472,713 37,472,713 (+0.00%) 3,710 3,710 (+0.00%) 3,009,127 3,009,127 (+0.00%) 0 0 (+0.00%)
block_22244135 (blobs DA) blob_versioned_hash 49,849,981 49,849,981 (+0.00%) 37,693,449 37,693,449 (+0.00%) 2,050 2,050 (+0.00%) 3,030,933 3,030,933 (+0.00%) 0 0 (+0.00%)
block_19299001 (blobs DA) da_commitment 1,899,315 1,895,514 (-0.20%) 1,810,355 1,806,554 (-0.21%) 5,560 5,560 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
block_19299001 (keccak DA) da_commitment 2,702,950 2,699,149 (-0.14%) 2,162,286 2,158,485 (-0.18%) 5,560 5,560 (+0.00%) 0 0 (+0.00%) 112,926 112,926 (+0.00%)
block_22244135 (blobs DA) da_commitment 1,160,727 1,158,453 (-0.20%) 1,109,207 1,106,933 (-0.21%) 3,220 3,220 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
block_22244135 (keccak DA) da_commitment 1,612,139 1,609,865 (-0.14%) 1,313,999 1,311,725 (-0.17%) 3,220 3,220 (+0.00%) 0 0 (+0.00%) 61,655 61,655 (+0.00%)
block_19299001 (keccak DA) run_tx_loop 186,698,725 184,377,370 (-1.24%) 138,612,373 136,291,018 (-1.67%) 316,840 316,840 (+0.00%) 7,681,862 7,681,862 (+0.00%) 3,072,366 3,072,366 (+0.00%)
block_22244135 (keccak DA) run_tx_loop 121,677,622 120,325,025 (-1.11%) 94,390,422 93,037,825 (-1.43%) 115,300 115,300 (+0.00%) 5,054,163 5,054,163 (+0.00%) 1,306,437 1,306,437 (+0.00%)
block_19299001 (blobs DA) state_commitment_update 12,318,520 12,225,753 (-0.75%) 11,193,560 11,100,793 (-0.83%) 70,310 70,310 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
block_19299001 (keccak DA) state_commitment_update 12,318,508 12,223,400 (-0.77%) 11,193,548 11,098,440 (-0.85%) 70,310 70,310 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
block_22244135 (blobs DA) state_commitment_update 7,093,791 7,036,782 (-0.80%) 6,440,671 6,383,662 (-0.89%) 40,820 40,820 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
block_22244135 (keccak DA) state_commitment_update 7,092,435 7,036,839 (-0.78%) 6,439,315 6,383,719 (-0.86%) 40,820 40,820 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
block_19299001 (keccak DA) system_init 45,058 45,058 (+0.00%) 45,058 45,058 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
block_22244135 (keccak DA) system_init 45,058 45,058 (+0.00%) 45,058 45,058 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
Precompiles test-crate bench (synthetic workload, all labels)
Benchmark Symbol Base Eff Head Eff (%) Base Raw Head Raw (%) Base Blake Head Blake (%) Base Bigint Head Bigint (%) Base Keccak Head Keccak (%)
precompiles bn254_ecadd 53,315 53,315 (+0.00%) 47,863 47,863 (+0.00%) 0 0 (+0.00%) 1,363 1,363 (+0.00%) 0 0 (+0.00%)
precompiles bn254_ecmul 731,892 731,892 (+0.00%) 567,704 567,704 (+0.00%) 0 0 (+0.00%) 41,047 41,047 (+0.00%) 0 0 (+0.00%)
precompiles bn254_pairing 71,468,694 71,468,694 (+0.00%) 56,940,550 56,940,550 (+0.00%) 0 0 (+0.00%) 3,632,036 3,632,036 (+0.00%) 0 0 (+0.00%)
precompiles da_commitment 16,706 16,685 (-0.13%) 13,630 13,609 (-0.15%) 30 30 (+0.00%) 0 0 (+0.00%) 649 649 (+0.00%)
precompiles ecrecover 370,599 369,233 (-0.37%) 241,811 241,033 (-0.32%) 0 0 (+0.00%) 31,548 31,401 (-0.47%) 649 649 (+0.00%)
precompiles id 925 925 (+0.00%) 925 925 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
precompiles keccak 31,673 31,673 (+0.00%) 10,901 10,901 (+0.00%) 0 0 (+0.00%) 1 1 (+0.00%) 5,192 5,192 (+0.00%)
precompiles modexp 31,888,536 31,888,577 (+0.00%) 21,230,716 21,230,757 (+0.00%) 0 0 (+0.00%) 2,664,455 2,664,455 (+0.00%) 0 0 (+0.00%)
precompiles p256_verify 747,278 747,278 (+0.00%) 468,586 468,586 (+0.00%) 0 0 (+0.00%) 69,673 69,673 (+0.00%) 0 0 (+0.00%)
precompiles process_block 144,538,154 144,547,127 (+0.01%) 114,942,250 114,937,299 (-0.00%) 5,340 5,370 (+0.56%) 7,325,696 7,329,057 (+0.05%) 51,920 51,920 (+0.00%)
precompiles process_transaction 72,052,750 72,055,387 (+0.00%) 57,303,718 57,305,331 (+0.00%) 160 160 (+0.00%) 3,664,552 3,664,808 (+0.01%) 22,066 22,066 (+0.00%)
precompiles ripemd 8,010 8,010 (+0.00%) 8,010 8,010 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
precompiles run_tx_loop 144,062,968 144,062,049 (-0.00%) 114,583,372 114,569,009 (-0.01%) 180 180 (+0.00%) 7,325,696 7,329,057 (+0.05%) 43,483 43,483 (+0.00%)
precompiles sha256 13,315 13,315 (+0.00%) 13,315 13,315 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
precompiles state_commitment_update 183,072 182,339 (-0.40%) 143,392 142,659 (-0.51%) 2,480 2,480 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)
precompiles system_init 49,767 49,768 (+0.00%) 49,767 49,768 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%) 0 0 (+0.00%)

Per-opcode

Per-opcode cycle diff

Opcode Count Med Cycles eff (%) Total Cycles eff (%) Med Cyc/Gas eff (%) Worst Cyc/Gas eff (%)
SLOAD 3237 1,134 (-2.8%) 5,051,727 (+1.8%) 5.8 (-0.3%) 14.5 (-0.2%)
SSTORE 950 2,178 (-0.9%) 2,124,054 (-2.0%) 0.7 (-0.6%) 33.6 (-5.9%)
EXTCODESIZE 291 456 (-0.2%) 509,527 (+2.0%) 3.5 (-0.3%) 6.3 (-0.2%)
LOG3 341 1,041 (+0.8%) 362,410 (+0.8%) 0.6 0.8 (+0.1%)
LOG2 49 921 (-0.9%) 47,070 (+0.1%) 0.6 (-1.0%) 0.8
LOG1 46 851 (+0.6%) 41,073 (+0.4%) 0.6 (+1.1%) 0.9 (-4.2%)
LOG4 28 1,229 (+2.2%) 38,317 (+1.4%) 0.5 (+2.4%) 0.5 (+2.0%)
SELFBALANCE 58 275 (-0.4%) 17,556 (-0.3%) 55.0 (-0.4%) 92.0 (-0.2%)
TLOAD 3 399 (+0.8%) 2,197 (+18.0%) 4.0 (+0.8%) 14.0 (+30.7%)
EXTCODEHASH 3 581 (-0.2%) 1,743 (-0.2%) 5.8 (-0.2%) 5.8 (-0.2%)
TSTORE 2 735 (-15.4%) 1,470 (-15.4%) 7.3 (-15.4%) 7.7 (-14.8%)

Per-precompile

Per-precompile per-execution ratios (head)
cycles = effective (raw + Blake×16 + BigInt×4 + Keccak×4)
precompile                count    med c/g    p95 c/g    p99 c/g    max c/g    med n/g    p95 n/g    p99 n/g    max n/g
------------------------------------------------------------------------------------------------------------------------
modexp                      105       71.1      713.5     2847.2     2847.7      300.0     1200.3     4814.0     4814.0
point_eval                    2     1025.1     1025.1     1025.1     1025.1     1262.1     1262.1     1262.1     1262.1
blake2f                       2      803.7      803.7      803.7      803.7        0.0        0.0        0.0        0.0
ecadd                        57      335.9      358.4      360.0      360.0      350.7      350.7      350.7      350.7
bls12_pairing_check           2      217.2      217.2      217.2      217.2        0.0        0.0        0.0        0.0
ecpairing                    31      168.4      185.6      185.6      185.6      398.2      428.6      428.6      428.6
keccak                     2497      111.7      126.6      139.3      150.6      478.8      558.6      626.8      684.2
ecmul                        37      119.0      124.1      126.5      126.5      127.3      127.3      127.3      127.3
ecrecover                    59      119.1      122.3      123.5      123.5      174.0      174.0      174.0      174.0
sha256                        4       68.4      123.3      123.3      123.3       80.6      131.5      131.5      131.5
p256_verify                  16      107.3      108.3      108.3      108.3      113.6      113.6      113.6      113.6
bls12_g1msm                   2      100.3      100.3      100.3      100.3        0.0        0.0        0.0        0.0
bls12_g2msm                   2       88.1       88.1       88.1       88.1        0.0        0.0        0.0        0.0
bls12_g2add                   2       45.0       45.0       45.0       45.0        0.0        0.0        0.0        0.0
identity                      5       22.7       34.3       34.3       34.3       31.4       48.1       48.1       48.1
bls12_g1add                   2       28.1       28.1       28.1       28.1        0.0        0.0        0.0        0.0
ripemd160                     4        4.4        7.4        7.4        7.4        8.1       13.1       13.1       13.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants