perf(point_evaluation): precompute tau*G2 prepared + fuse G1 scalar muls#661
Draft
0xVolosnikov wants to merge 2 commits into
Draft
perf(point_evaluation): precompute tau*G2 prepared + fuse G1 scalar muls#6610xVolosnikov wants to merge 2 commits into
0xVolosnikov wants to merge 2 commits into
Conversation
ff58d23 to
4e6a520
Compare
bfe2ded to
bde7e33
Compare
4e6a520 to
a6b35e7
Compare
bde7e33 to
cb9e354
Compare
a6b35e7 to
7e05a97
Compare
Replace the per-call G2Prepared::from(G2_BY_TAU_POINT) — which runs the 68-step BLS12-381 Miller-loop coefficient precomputation (63 G2 doublings + 5 G2 adds in Fq2) on every KZG verification — with a compile-time const PREPARED_G2_BY_TAU. The const is generated for both field representations (6-limb ark_bls12_381 Fp for host, 8-limb ark_ff_delegation Fp for proving / RISC-V) and cfg-gated on the `proving` feature so the right variant compiles in each build. A sanity test asserts the const matches G2_BY_TAU_POINT.into() at runtime in both variants — catches stale literals if G2_BY_TAU_POINT or the Miller-loop precomputation shape ever changes upstream. Bench (test_kzg_regression under ZKSYNC_RISC_V_RUN=true with the for-tests-benchmarking RISC-V binary; A/B against the rearranged-only version from the previous commit, same parent commit): point_evaluation raw cycles 31,791,155 → 29,615,041 −6.84% point_evaluation bigint deleg. 2,513,437 → 2,330,478 −7.28% effective (raw + 4·bigint) 41,844,903 → 38,936,953 −6.95% Combined with the rearrangement, point_evaluation is now down 24.16% effective from the original verifier. Binary grows by 6,080 bytes (the inlined ell-coeffs literal — 68 ell_coeff triples × 6 Fq elements × 8 u64 limbs in the proving variant), still 9,824 bytes smaller than the pre-rearrangement baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rearranged verifier did two independent 255-bit G1 scalar multiplications back-to-back: y * G1_gen and z * proof. Each sw_double_and_add_projective does ~255 doublings + ~127 adds, so the two-scalar form pays ~510 doublings. Fuse them into a single interleaved double-and-add over the two bases. Same pattern as the small-N path in bls12_381::msm — we avoid arkworks' VariableBaseMSM::msm_bigint because it allocates internally and the proving binary's allocator setup makes it trap with an illegal instruction in the simulator. The local hand-rolled loop has no allocations and matches the reference verifier byte-for-byte (test_rearranged_kzg_verifier_matches_reference still passes). To express y*G1 - z*proof as a 2-base MSM with positive scalars we need -z mod r; computed once via Fr::from_bigint(z).neg().into_bigint() (z is canonical by upstream parse_scalar / Fr::into_bigint). Bench (test_kzg_regression under ZKSYNC_RISC_V_RUN=true with the for-tests-benchmarking RISC-V binary; A/B vs the previous commit which precomputed PREPARED_G2_BY_TAU): point_evaluation raw cycles 29,615,041 -> 29,391,408 -0.76% point_evaluation bigint deleg. 2,330,478 -> 2,308,496 -0.94% effective (raw + 4*bigint) 38,936,953 -> 38,625,392 -0.80% dist/for_tests/app.bin size 1,343,276 -> 1,340,044 -3,232 B Cumulative against the original (pre-rearrangement) verifier this is now 24.77% off effective; the smaller gain than the previous two commits reflects that the residual cost is now dominated by the Miller loop + final exponentiation, which are inside the pairing primitive itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cb9e354 to
540d622
Compare
Contributor
Block-level effective cyclesAverage across all block fixtures (
Per-block effective cycles
Block-level sub-phases
Precompiles test-crate bench (synthetic workload, all labels)
FRI precompile bench (FriProofTx + sidecar + contract call)
Per-precompilePer-precompile per-execution ratios (head) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What ❔
Two follow-up optimizations on top of the rearranged KZG verifier (PR #660). Each commit is one optimization; benched A/B at the same parent commit using
test_kzg_regressionunderZKSYNC_RISC_V_RUN=truewith thefor-tests-benchmarkingRISC-V binary.Commit 1 — precompute
PREPARED_G2_BY_TAUas a constτ·G2 is the fixed KZG trusted-setup point. Every call to
verify_kzg_proofpreviously ran the 68-step BLS12-381 Miller-loop coefficient precomputation (63 G2 doublings + 5 G2 adds in Fq²) on it. Moved to compile time as aconst PREPARED_G2_BY_TAU. Same pattern as the upstreamPREPARED_G2_GENERATORconst in airbender-crypto, applied to τ·G2.Two cfg-gated variants:
#[cfg(not(feature = "proving"))]): 6-limbark_bls12_381::Fp#[cfg(feature = "proving")]): 8-limbark_ff_delegation::Fp(delegation-aligned)A
prepared_g2_by_tau_const_matches_runtimetest asserts byte-equality vsG2_BY_TAU_POINT.into()in each variant — catches stale literals if upstream τ·G2 or the Miller-loop precomputation shape ever changes.Commit 2 — fuse the two G1 scalar mults into a 2-base interleaved double-and-add
The rearranged verifier still did
y * G1_genandz * proofas two independentsw_double_and_add_projectivecalls — ~510 doublings total. Fused into a single interleaved loop over the 256 bits of the two scalars (~255 doublings + ~256 conditional adds).Avoids arkworks'
VariableBaseMSM::msm_bigint, which traps with an illegal instruction in the proving simulator because of internal allocation. Same pattern as the existing small-N path inbasic_system::system_functions::bls12_381::msm— the local hand-rolled loop is allocation-free.Negation of
z(to fold the subtraction into an MSM-style positive-scalar form) is done once viaFr::from_bigint(z).neg().into_bigint().zis canonical by upstreamparse_scalar/Fr::into_bigint, so thefrom_bigintOptionis statically safe to unwrap.Why ❔
Cumulative effect on
point_evaluationcycles:point_evaluationraw cyclespoint_evaluationbigint delegationsdist/for_tests/app.binvs the original pre-rearrangement verifier: −24.77% effective.
The bigger of the two wins is the τG2 const — about 6 KB binary cost for ~7% cycle savings, which more than pays back compared to the residual binary headroom (still under the original baseline). The fused-mul is a smaller win (~0.8%) but is also a net binary-size reduction because it replaces two inlined
mul_bigintinstantiations with one short loop.The diminishing returns reflect that the residual cost is now dominated by the Miller loop + final exponentiation — primitives that live inside airbender-crypto and aren't reachable from this codepath.
Is this a breaking change?
Checklist
Stacked on
Based on
vv/kzg-rearranged-verifier(PR #660). Lands cleanly on top once #660 merges.Follow-ups not in this PR
PREPARED_G2_GENERATOR, so other consumers benefit and the const lives next to the source of truth.VariableBaseMSM::msm_biginttraps in the proving simulator; if fixable, the small-N hand-rolled loop could be replaced by the upstream Pippenger path with its 3-bit windows for an additional ~2-3% in this codepath. (Probably an allocator quirk, since the precompiles crate hits the same issue.)🤖 Generated with Claude Code