feat(evm_interpreter): specialize PUSH3..PUSH8 via a u64-sized helper#648
Open
0xVolosnikov wants to merge 1 commit into
Open
feat(evm_interpreter): specialize PUSH3..PUSH8 via a u64-sized helper#6480xVolosnikov wants to merge 1 commit into
0xVolosnikov wants to merge 1 commit into
Conversation
6 tasks
6 tasks
99c8d5c to
1901bd2
Compare
509586e to
7a3b337
Compare
ly0va
approved these changes
May 15, 2026
There was a problem hiding this comment.
Pull request overview
This PR optimizes EVM PUSH3 through PUSH8 handling by routing them through a new u64-based helper instead of the generic U256 path, reducing work in the interpreter hot path while preserving existing PUSH semantics.
Changes:
- Added
push_small<const N>()forPUSH3..=PUSH8, including truncated-bytecode zero-padding behavior. - Updated opcode dispatch to use the new helper for
PUSH3throughPUSH8. - Left
PUSH1,PUSH2, andPUSH9..PUSH32on their existing specialized/generic paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
evm_interpreter/src/interpreter.rs |
Routes PUSH3..PUSH8 opcodes to the new small push helper. |
evm_interpreter/src/instructions/stack.rs |
Adds the push_small helper that assembles small PUSH payloads into a u64. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PUSH3..PUSH8 all fit in a u64 but were going through the generic push<const N: usize> path, which copies bytes into a U256, byte-reverses the full 32-byte value, and shifts off the unused high bytes. That work is unnecessary when the payload is <= 8 bytes. Add a single `push_small<const N: usize>` helper that: - in the common (in-bounds) case, copies the N payload bytes into the high end of an [u8; 8] buffer and decodes via `u64::from_be_bytes`; - on the truncated-bytecode tail, falls back to a per-byte zero-padded loop. Either path produces the same u64 the generic path would yield, then pushes it directly via EvmStack::push_u64 — skipping the bytereverse + shift. Dispatch in interpreter.rs swaps `self.push::<N>()` for `self.push_small::<N>()` for N in 3..=8. PUSH1 and PUSH2 keep their bespoke specializations. Benchmark (bench_scripts/bench.sh compare against PUSH2-specialized baseline): - block_19299001 process_block: -0.31% effective (-652K cycles) - block_22244135 process_block: -0.40% effective (-545K cycles) - PUSH3 median: 144 -> 73 cycles (-49.7%) - PUSH4 median: 142 -> 73 cycles (-48.2%), total -545K cycles - PUSH5 median: 147 -> 77 cycles (-47.6%) - PUSH6 median: 152 -> 82 cycles (-45.7%) - PUSH7 median: 175 -> 85 cycles (-51.1%) - PUSH8 median: 162 -> 83 cycles (-48.4%) - Delegations (Blake/BigInt/Keccak): unchanged Some other opcodes show small positive regressions (PUSH1 +1.6%, PUSH2 +1.5%, MLOAD +2.4%, CALLDATALOAD +4.4%, EQ +4.6%) consistent with the I-cache sensitivity already known on this interpreter loop. The net block-level effective-cycle savings dominate. Intermediate stage 1 (PUSH3..PUSH5 only) showed -0.28%/-0.40%; stage 2 (PUSH3..PUSH8) improves block_19299001 to -0.31% without further regression elsewhere. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fe7dde7 to
f86d12d
Compare
Contributor
Block-level effective cyclesAverage across all block fixtures (
Per-block effective cycles
Block-level sub-phases
Precompiles test-crate bench (synthetic workload, all labels)
FRI precompile bench (FriProofTx + sidecar + contract call)
Per-opcodePer-opcode cycle diff
Per-precompilePer-precompile per-execution ratios (head) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What ❔
Add a single
push_small<const N: usize>()helper that specializes PUSH3..PUSH8 in the EVM interpreter, in the same spirit as the existing PUSH1 and PUSH2 specializations (PR #646).All of PUSH3..PUSH8 fit in a
u64(≤ 8 bytes of payload), but were routed through the genericpush<const N: usize>()path, which copies bytecode bytes into aU256, byte-reverses the full 32-byte value, and right-shifts off the unused high bytes. That work is unnecessary for small N.The new helper:
[u8; 8]buffer, then decode viau64::from_be_bytes. One bounds check, one chunk copy, one byte-swap.Both paths produce the same u64 the generic path would yield, then push it directly via
EvmStack::push_u64.Dispatch in
interpreter.rsswapsself.push::<N>()forself.push_small::<N>()for N in 3..=8. PUSH1 and PUSH2 keep their bespoke specializations (each is simple enough that a helper would obscure rather than simplify).Why ❔
Stacks on top of PR #646 (PUSH2 specialization). With PUSH2 specialized alone, PUSH3..PUSH8 collectively still represent a meaningful slice of opcode cycles in benchmark blocks. Bringing them to PUSH1/PUSH2 parity (~67-85 cycles each instead of ~144-175) saves real cycles per call and continues the cleanup of the same hot path.
Benchmark
bench_scripts/bench.sh compare, baseline =vv-push2-specializationHEAD (i.e. PUSH2 already specialized):block_19299001block_22244135Per-opcode (block_19299001):
Delegations (Blake / BigInt / Keccak): unchanged.
Incremental verification
The work was benchmarked in two stages on the same baseline:
Stage 2 adds PUSH6..PUSH8 on top of Stage 1 and improves block_19299001 by an additional ~70K cycles without making block_22244135 worse — confirming that extending the specialization through N=8 is still net positive despite the larger binary.
Notes on cross-opcode effects
Some other opcodes show small positive regressions (PUSH1 +1.6%, PUSH2 +1.5%, MLOAD +2.4%, CALLDATALOAD +4.4%, EQ +4.6%). These are consistent with the I-cache sensitivity already known on this interpreter loop — adding code to the dispatch hot path shifts neighbor instructions. The block-level effective-cycle savings dominate, but reviewers should be aware.
Is this a breaking change?
No protocol-visible behavior change. Gas and native cost charging are unchanged. The pushed value is bit-for-bit identical to the generic path on all inputs, including the truncated-bytecode edge case.
Checklist
cargo test -p evm_interpreter --features testing(13/13),tests/instances/evm(14/14).Base branch note
This PR targets
vv-push2-specialization(PR #646). When that lands intocustom-u256, this PR's base can be retargeted tocustom-u256.🤖 Generated with Claude Code