perf(history_map): arena-allocate ElementWithHistory#669
Conversation
## What ❔ Replaces `Bytes32`'s `Ord` impl on **the RISC-V proving target only** with a word-by-word equality scan over the underlying `inner: [usize; BYTES32_USIZE_SIZE]` (N=4 on 64-bit, 8 on RISC-V32): - Equal-prefix iterations stay in the fast path: load word, compare word, branch. - On the first differing word, resolve byte-lex order by walking the bytes of just that word — cheaper than `swap_bytes()` on RV32 without the Zbb extension (which this target does not enable). - `cmp` / `partial_cmp` are `#[inline]` since they sit on the BTreeMap descent hot path. On non-RV32 targets (forward / sequencer host) the impl falls back to the original `as_u8_array_ref().cmp(...)`. libc's `memcmp` is already SIMD-vectorized (SSE2/AVX/NEON), so the byte path beats a pure-Rust word loop on the host. The chunked helper (`cmp_word_chunked`) is still compiled on every target so the equivalence tests can validate it on the host. Affected: - `zk_ee/src/utils/bytes32.rs` ## Why ❔ From the flamegraph analysis on `draft-0.4.0`: - `impls::compare_bytes` is the **second-highest self-cost function in the binary at 6.47 %**. On no-std RISC-V the `<[u8]>::cmp` path falls back to `compiler-builtins`' generic `memcmp` (`compiler-builtins/src/mem/impls.rs:388`) — a plain byte loop with no chunking. - Almost all of that self-cost is reached through `Bytes32::cmp` (directly, or via `WarmStorageKey::cmp` → `Bytes32::cmp`) sitting inside `NodeRef::find_key_index` in HistoryMap / preimage / FlatStorageCommitment lookups. Comparing word-aligned 32-byte values byte-by-byte is wasteful when the struct is already a `[usize; N]` with `align(8)`. Forward mode doesn't need the rewrite because libc memcmp already handles it efficiently. ## Benchmark results Branch vs `origin/draft-0.4.0` (commit `59bdedfa`): | Block | Base eff | Head eff | Δ eff | Δ raw | |---|---|---|---|---| | `19299001` | 208,601,280 | 204,648,699 | **−1.89 %** | **−2.49 %** | | `22244135` | 134,741,585 | 132,504,954 | **−1.66 %** | **−2.10 %** | Blake / Bigint / Keccak delegation counts unchanged in both runs — pure raw-cycle reduction. ### Variant comparison (RV32 only — host path is unchanged) | Variant | block 19299001 Δ eff | block 22244135 Δ eff | |---|---|---| | word-cmp + `to_be()` bswap on mismatch | −1.56 % | −1.39 % | | **word-cmp + `#[inline]` + byte-fallback (this PR)** | **−1.89 %** | **−1.66 %** | | word-cmp + 2-word XOR-OR fusion | −1.68 % | −1.45 % | The 2-word XOR-OR variant regresses vs the simple loop — the extra arithmetic per executed iteration doesn't pay for the halved branch count on RV32. Picked the byte-fallback variant. ## Is this a breaking change? - [ ] Yes - [x] No Public API of `Bytes32` is unchanged. `Ord` / `PartialOrd` impls produce identical orderings to the previous implementation on every target (verified by `cmp_tests::cmp_matches_byte_lex_on_pseudorandom_pairs`, which invokes `cmp_word_chunked` directly so the helper is exercised on the host too). ## Stackability with #668 / #669 This PR targets the leaf `compare_bytes` cost; #668 / #669 target the BTreeMap lookups themselves on the pending-list paths. They hit different code regions and are expected to compose (overlap only where BTreeMap node descent invokes `Bytes32::cmp` — and even there the per-compare cost goes down). ## Checklist - [x] PR title corresponds to the body of PR. - [x] Tests for the changes have been added / updated. (Equivalence tests added: `cmp_matches_byte_lex_on_handcrafted_pairs`, `cmp_matches_byte_lex_on_pseudorandom_pairs`. Both invoke `cmp_word_chunked` directly so the chunked path is exercised on the host.) - [x] Documentation comments have been added / updated. - [x] Code has been formatted. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7e50da0 to
f237dc9
Compare
There was a problem hiding this comment.
Pull request overview
This PR changes HistoryMap’s internal storage so ElementWithHistory values are arena-allocated in fixed-size ListVec pages, and both the BTreeMap and the pending-updates list store NonNull pointers into that arena. The goal is to reduce per-element allocation overhead and avoid key cloning / repeated BTreeMap descents on rollback/commit/pending-iteration paths.
Changes:
- Store
NonNull<ElementWithHistory<...>>in theBTreeMapand inpending_updated_elements, replacing key-based pending bookkeeping. - Embed
KinsideElementWithHistoryso callbacks/iterators can surface&Kwithout looking back into theBTreeMap. - Introduce
ElementWithHistoryArenabacked byListVecpages to amortize allocations across elements.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
zk_ee/src/common_structs/history_map/mod.rs |
Reworks HistoryMap to use stable arena pointers for map values and pending updates; updates rollback/commit/iter paths accordingly. |
zk_ee/src/common_structs/history_map/element_with_history.rs |
Adds embedded key storage by introducing a new K type parameter and storing key: K in the element. |
zk_ee/src/common_structs/history_map/element_with_history_arena.rs |
New arena wrapper around ListVec to allocate ElementWithHistory in fixed-size pages and return stable pointers. |
Comments suppressed due to low confidence (1)
zk_ee/src/common_structs/history_map/element_with_history.rs:43
ElementWithHistoryis apubtype in apubmodule, and this change adds a new generic parameterKplus a new requiredkeyargument innew(). That is a breaking public API change for any downstream code that usesElementWithHistorydirectly, which conflicts with the PR’s “not breaking” claim. IfElementWithHistoryis meant to be internal, consider making the type/modulepub(crate); otherwise consider preserving backward compatibility (e.g., a type alias or an alternate constructor).
/// The history linked list. Always has at least one item with the snapshot id of 0.
///
/// `key` is embedded so that the pending-updates list can store a stable pointer
/// to an `ElementWithHistory` and still surface the key on iteration without
/// going back through the BTreeMap.
pub struct ElementWithHistory<K, V, A: Allocator + Clone, EP = ()> {
/// Key owned by this element (separate from the BTreeMap key copy).
pub key: K,
/// Additional properties associated with the element globally.
/// These properties persist across rollbacks/commits and don't participate in snapshots
pub element_properties: EP,
/// Initial record (before history started)
pub initial: HistoryRecordLink<V>,
first: HistoryRecordLink<V>,
/// Current history record
pub head: HistoryRecordLink<V>,
/// Record that has been committed, or initial if not commit has been
/// performed.
pub committed: HistoryRecordLink<V>,
marker: PhantomData<A>,
}
impl<K, V, A: Allocator + Clone, KP> ElementWithHistory<K, V, A, KP> {
#[inline(always)]
pub fn new(
key: K,
key_properties: KP,
initial_value: V,
records_memory_pool: &mut ElementPool<V, A>,
) -> Self {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| use alloc::collections::btree_map::Entry; | ||
| use alloc::collections::BTreeMap; | ||
| use core::{alloc::Allocator, fmt::Debug, ops::Bound}; | ||
| use core::{alloc::Allocator, fmt::Debug, ops::Bound, ptr::NonNull}; |
There was a problem hiding this comment.
False positive — fmt::Debug is still referenced via the #[derive(Debug)] on NopSnapshotId and CacheSnapshotId further down in the same file. cargo clippy --workspace -- -D warnings passes cleanly against this exact commit; no action needed.
| // Releases the arena pages along with the `ElementWithHistory` values | ||
| // (and their owned keys). Pointers stored in `btree` or pending list | ||
| // would be dangling after this — but both have just been cleared. | ||
| self.elements_arena.clear(); | ||
| self.state.next_snapshot_id = CacheSnapshotId(1); | ||
| self.state.frozen_snapshot_id = CacheSnapshotId(0); | ||
| self.state.pending_updated_elements = StackLinkedList::empty(self.state.alloc.clone()); |
There was a problem hiding this comment.
Fixed in 507ef38. HistoryMap::clear now drops the BTreeMap and the pending-updates list before releasing the arena pages, so the "every pointer stored in either container is valid" invariant holds at every observable point.
| .top_mut() | ||
| .expect("ListVec::top_mut must yield the slot we just pushed"); |
There was a problem hiding this comment.
Good catch — fixed in 507ef38. Added ListVec::push_returning_ref that uses LinkedList::back_mut (O(1)) instead of iter_mut().last() (O(pages)). allocate() now hits the back node directly without walking the chain twice.
| // LinkedList::clear drops the ArrayVec nodes, which in turn drop the | ||
| // contained ElementWithHistory values (and their owned keys). | ||
| self.buffer.0.clear(); |
There was a problem hiding this comment.
Fixed in 507ef38 — clear() now calls self.buffer.clear() via the Stack trait instead of reaching into ListVec's tuple field.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
zk_ee/src/common_structs/history_map/mod.rs:146
get_mut()/get_or_insert()still use a shared'slifetime for both&mut selfand thekeyreference, but the returnedHistoryMapItemRefMutno longer stores&K. Consider changing these to acceptkey: &K(no'sparameter) so the mutable borrow of the map determines the output lifetime, rather than also requiring the caller'skeyborrow to live equally long.
/// Get history of an element by key, mutable
pub fn get_mut<'s>(&'s mut self, key: &'s K) -> Option<HistoryMapItemRefMut<'s, K, V, A, KP>> {
let ptr = *self.btree.get(key)?;
Some(HistoryMapItemRefMut {
// Safety: pointer is valid for the lifetime of `&mut self`. We borrow
// the arena element via the raw pointer rather than re-entering the
// BTreeMap so that `cache_state` and `records_memory_pool` can be
// borrowed mutably alongside.
history: unsafe { &mut *ptr.as_ptr() },
cache_state: &mut self.state,
records_memory_pool: &mut self.records_memory_pool,
})
}
/// Get history of an element by key or use callback to insert initial value
pub fn get_or_insert<'s, E>(
&'s mut self,
key: &'s K,
spawn_v: impl FnOnce() -> Result<(V, KP), E>,
) -> Result<HistoryMapItemRefMut<'s, K, V, A, KP>, E> {
let ptr = match self.btree.entry(key.clone()) {
Entry::Occupied(o) => *o.into_mut(),
| /// Get history of an element by key | ||
| pub fn get<'s>(&'s self, key: &'s K) -> Option<HistoryMapItemRef<'s, K, V, A, KP>> { | ||
| self.btree | ||
| .get(key) | ||
| .map(|ec| HistoryMapItemRef { key, history: ec }) | ||
| self.btree.get(key).map(|ptr| HistoryMapItemRef { | ||
| // Safety: pointer is valid for the lifetime of `&self`. | ||
| history: unsafe { ptr.as_ref() }, | ||
| }) | ||
| } |
There was a problem hiding this comment.
Good observation — fixed in 2d385c1. get / get_mut / get_or_insert no longer tie the input &K reference's lifetime to &self. Now HistoryMapItemRef<'_> / Mut<'_> borrow only via &self / &mut self, so callers can drop the key reference before using the returned ref.
| /// The history linked list. Always has at least one item with the snapshot id of 0. | ||
| pub struct ElementWithHistory<V, A: Allocator + Clone, EP = ()> { | ||
| /// | ||
| /// `key` is embedded so that the pending-updates list can store a stable pointer | ||
| /// to an `ElementWithHistory` and still surface the key on iteration without | ||
| /// going back through the BTreeMap. | ||
| pub struct ElementWithHistory<K, V, A: Allocator + Clone, EP = ()> { | ||
| /// Key owned by this element (separate from the BTreeMap key copy). | ||
| pub key: K, | ||
| /// Additional properties associated with the element globally. | ||
| /// These properties persist across rollbacks/commits and don't participate in snapshots | ||
| pub element_properties: EP, |
There was a problem hiding this comment.
Agreed — verified neither ElementWithHistory nor HistoryRecord is referenced outside the history_map module, so the pub mod exposure was unintentional. Fixed in 2d385c1: element_with_history is now pub(crate), eliminating the unintended public surface.
Stores ElementWithHistory in a fixed-page-size ListVec arena (mirroring ElementPool for HistoryRecord). BTreeMap holds NonNull pointers into the arena; pending-updates list stores those same pointers instead of cloned keys. Bypasses BTreeMap descents on rollback/commit/iter-pending paths and amortizes ElementWithHistory allocations over arena pages. K is embedded inside ElementWithHistory so iterators/callbacks still surface the key without a BTreeMap lookup. Experimental — Option B sibling of vv/history-map-box (Box variant).
`cargo clippy --workspace -- -D warnings` flagged the `btree.iter().map(|(_k, ptr)| ...)` pattern in HistoryMap::iter as `clippy::iter_kv_map`. Replace with `btree.values().map(|ptr| ...)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…abstraction Three review-driven fixes to the arena-based HistoryMap: * `ElementWithHistoryArena::allocate` previously did `Stack::push` followed by `Stack::top_mut`, each of which walks the backing `LinkedList` via `iter_mut().last()` (O(pages)). Add `ListVec::push_returning_ref` that uses `LinkedList::back_mut` (O(1)) once instead — turns per-insert work from O(pages) into O(1). * `HistoryMap::clear` now drops the BTreeMap and the pending-updates list *before* releasing the arena pages, so the "every pointer stored in either container is valid" invariant holds at every observable point. Equivalent under the current `&mut self` borrow but defends against any future panic path opening up between the two drops. * `ElementWithHistoryArena::clear` calls `Stack::clear` instead of reaching into `ListVec`'s tuple-struct field (`buffer.0.clear()`), avoiding a coupling to ListVec's internal representation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nsert lifetimes
Two more review-driven fixes:
* `element_with_history` module is now `pub(crate)` since `ElementWithHistory`
and `HistoryRecord` have no external callers; this PR's addition of the
`K` type parameter would otherwise be a public breaking-API change.
* `HistoryMap::{get, get_mut, get_or_insert}` no longer tie the input `key`
reference's lifetime to `&self`. `HistoryMapItemRef`/`Mut` only need to
borrow the arena element (now owned via `&self` / `&mut self`), so the
signature relaxes to `key: &K`, letting callers drop the key borrow
before using the returned ref.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2d385c1 to
3042aa2
Compare
Block-level effective cyclesAverage across all block fixtures (
Per-block effective cycles
Block-level sub-phases
Precompiles test-crate bench (synthetic workload, all labels)
FRI precompile bench (FriProofTx + sidecar + contract call)
Per-opcodePer-opcode cycle diff
Per-precompilePer-precompile per-execution ratios (head) |
What ❔
Sibling experiment to #668 (Box variant). Allocates
ElementWithHistoryfrom a fixed-pageListVecarena (element_with_history_arena.rs) instead of oneBox::new_inper element. BTreeMap storesNonNullpointers into the arena; pending-updates list stores those same pointers. Same K-embedding insideElementWithHistoryas #668 so callbacks/iterators surface&Kwithout a BTreeMap lookup.Affected:
zk_ee/src/common_structs/history_map/mod.rszk_ee/src/common_structs/history_map/element_with_history.rszk_ee/src/common_structs/history_map/element_with_history_arena.rs(new)Why ❔
Same goal as #668 — bypass BTreeMap descents on the bookkeeping paths and skip K::clone on every
update(). This variant additionally amortizes the per-element allocation over arena pages, matching the existingElementPoolstyle forHistoryRecord.Benchmark results
Same
origin/draft-0.4.0baseline as #668, same block fixtures.Arena vs baseline:
1929900122244135Head-to-head vs the Box variant (#668):
1929900122244135Delegation counts identical across all four runs; pure raw-cycle savings.
So Option B adds ~+0.3 % on top of A. The bulk of the win (BTreeMap-descent and K::clone elimination) is already captured by A; B's additional savings come from collapsing
Box::new_incalls into one allocation per arena page.Forward-mode expectations
Forward execution has real CPU caches and TLB. The arena layout packs ~32
ElementWithHistoryper page vs scatteredBoxheap slots, so B should be modestly more cache- and TLB-friendly there (HistoryMap is rarely the forward-mode bottleneck though). Not measured — the project doesn't currently surface a forward-mode wall-clock benchmark.Is this a breaking change?
Public API of
HistoryMap,HistoryMapItemRef,HistoryMapItemRefMutis preserved.Relationship to #668
This PR and #668 implement the same idea (stable-address
ElementWithHistory+ pointer-only pending list) with different storage strategies. Only one should land. Numbers above are the input for that decision.Checklist
miri_*and rig-style HistoryMap tests cover the arena paths; internalElementWithHistorytests updated for the newKtype parameter.)