This workspace implements the square-matrix t = 2 multilinear polynomial
commitment from Brakedown, instantiated with coset Reed-Solomon encoding as in
the Shockwave variant.
The implementation uses:
- ICICLE v4.0.0
BN254 ScalarFieldas the only field representation. - ICICLE coset INTT/NTT for Reed-Solomon interpolation and evaluation.
- ICICLE Keccak-256 for Merkle leaves.
- ICICLE Blake2s for Merkle internal nodes.
- CPU Keccak-256 for the small Fiat-Shamir transcript.
- One complete encoded matrix column per Merkle leaf.
- Independent testing and evaluation query sets.
- One Merkle opening per unique column in the union of both query sets.
- CPU execution or a CUDA path for NTT, Merkle hashing, large Prove matrix products, and Verify RS/consistency checks.
See ICICLE_INTEGRATION.md for the concrete ICICLE Rust APIs, memory layout, backend setup, and official source locations.
crates/pcs/src/
field.rs ICICLE BN254 field helpers and canonical challenge reduction
polynomial.rs multilinear evaluations, eq vectors, and q_row^T U q_col
reed_solomon.rs CPU/CUDA batched coset INTT/NTT Reed-Solomon encoding
params.rs validated Shockwave dimensions and query count
encoding.rs column-major encoded matrix and retained CUDA buffers
serialization.rs canonical fixed-width field serialization
commitment.rs batched ICICLE Keccak/Blake2s Merkle commitment
transcript.rs CPU Keccak-256 Fiat-Shamir transcript
shockwave.rs Setup, Commit, Open, ProveEvaluation, VerifyEvaluation
crates/cli/src/main.rs
demo and small/medium/large/xlarge benchmark commands
For ell multilinear variables:
n = 2^ell
m = 2^(ell/2)
U is the m x m reshape of the polynomial evaluation vector
encoded_cols = m * rho_inv
The current implementation requires even ell, a square t = 2 matrix, and a
power-of-two rho_inv.
Entry point:
ShockwavePcs::setup(num_vars, rho_inv, num_queries, backend)Call flow:
ShockwavePcs::setup (crates/pcs/src/shockwave.rs:127)
-> ShockwaveParams::new (crates/pcs/src/params.rs:35)
validates ell, rho_inv, and num_queries
derives n, rows, cols, and encoded_cols
-> initialize_icicle (crates/pcs/src/commitment.rs:385)
loads CUDA once when requested
selects the active CPU or CUDA device
-> initialize_ntt_domain (crates/pcs/src/reed_solomon.rs:40)
creates or enlarges the per-device NTT domain
reuses a larger initialized domain for smaller transforms
CUDA setup currently initializes matching domains for both CUDA and the CPU fallback. Fiat-Shamir and Merkle path verification use direct Rust CPU hashing, while CUDA Verify uses the CUDA NTT and matrix-operation paths.
Entry point:
pcs.commit(&poly)Call flow:
ShockwavePcs::commit (crates/pcs/src/shockwave.rs:155)
-> encode_matrix (crates/pcs/src/encoding.rs:128)
-> rs_encode_matrix (crates/pcs/src/reed_solomon.rs:68)
CPU -> encode_on_cpu (crates/pcs/src/reed_solomon.rs:124)
CUDA -> encode_on_cuda (crates/pcs/src/reed_solomon.rs:166)
-> build_column_commitment (crates/pcs/src/commitment.rs:420)
-> ColumnMerkleTree::root (crates/pcs/src/commitment.rs:143)
The Reed-Solomon transform is:
each row U[i]
-> coset INTT over H, offset 5
-> zero-pad coefficients from m to encoded_cols
-> coset NTT over L, offset 7
-> encoded row U_hat[i]
CPU Commit executes all rows as one ICICLE batch and then converts the encoded matrix to column-major host storage.
CUDA Commit keeps the large data path on the GPU:
upload row-major U once
-> batch CUDA coset INTT
-> CUDA transpose-based zero-padding layout
-> batch CUDA coset NTT
-> CUDA transpose to column-major U_hat
-> batched CUDA Keccak leaf hashing
-> batched CUDA Blake2s Merkle layers
-> 32-byte root
The column-major encoded matrix is copied to host once for future openings. CUDA mode also retains:
- the original row-major polynomial matrix for large Prove matrix products;
- the encoded column-major matrix for direct Merkle leaf hashing.
For polynomial evaluations u:
u in F^n
U in F^(m x m)
U_hat[i] = RS(U[i])
U_hat in F^(m x encoded_cols)
commitment = MerkleRoot(columns(U_hat))
Entry point:
pcs.open_polynomial(&commitment, &poly)Call flow:
ShockwavePcs::open_polynomial (crates/pcs/src/shockwave.rs:168)
-> validate_commitment
-> ShockwavePcs::commit
-> compare the recomputed root with the supplied commitment
This is the complete polynomial opening. It is separate from the succinct evaluation proof below.
Entry point:
pcs.prove_evaluation(&poly, &state, &point)Call flow:
ShockwavePcs::prove_evaluation (crates/pcs/src/shockwave.rs:179)
-> prove_evaluation_with_backend (crates/pcs/src/shockwave.rs:189)
-> q1_q2_from_r (crates/pcs/src/polynomial.rs:100)
derives q_row and q_col
-> ShockwavePcs::combine_rows (crates/pcs/src/shockwave.rs:335)
evaluation_response = q_row^T U
-> inner_product (crates/pcs/src/shockwave.rs:493)
y = <evaluation_response, q_col>
-> Transcript::challenge_fields (crates/pcs/src/transcript.rs:67)
derives testing vector alpha
-> ShockwavePcs::combine_rows
testing_response = alpha^T U
-> Transcript::challenge_indices (crates/pcs/src/transcript.rs:82)
derives independent testing and evaluation query sets
-> unique_query_indices (crates/pcs/src/shockwave.rs:509)
computes the sorted union of both query sets
-> ColumnMerkleTree::open_columns (crates/pcs/src/commitment.rs:157)
opens every unique queried column exactly once
Backend policy:
- CPU mode uses direct CPU field arithmetic.
- CUDA small/medium use CPU row combinations because GPU launch overhead is larger than the work.
- CUDA large/xlarge use ICICLE CUDA
matmulagainst the polynomial matrix retained on the GPU during Commit.
The owned evaluation opening contains:
claimed value y
testing response
evaluation response
one column value and Merkle path for each unique queried column
The two Fiat-Shamir query sets remain independent. If a column appears in both sets, the proof carries that column and authentication path once and both consistency phases reference the same authenticated opening.
Entry point:
pcs.verify_evaluation(&commitment, &point, claimed_value, &proof)Call flow:
ShockwavePcs::verify_evaluation (crates/pcs/src/shockwave.rs:239)
-> verify_evaluation_with_backend (crates/pcs/src/shockwave.rs:250)
-> validate point, commitment, and proof shape
-> reconstruct the Fiat-Shamir transcript
-> derive alpha and both expected query sets
-> recompute the sorted query union
-> reject unless proof openings exactly match the unique union
-> ColumnMerkleTree::verify_columns_on_cpu
(crates/pcs/src/commitment.rs:184)
hashes opened leaves with CPU Keccak-256
reconstructs paths with CPU Blake2s
-> rs_encode_rows (crates/pcs/src/reed_solomon.rs:108)
encodes testing_response and evaluation_response as one two-row batch
-> ShockwavePcs::verify_consistency_openings
(crates/pcs/src/shockwave.rs:363)
checks testing and evaluation as two independent phases
each phase reads its columns from the shared unique opening list
-> inner_product (crates/pcs/src/shockwave.rs:493)
checks y = <evaluation_response, q_col>
Backend policy:
- CPU mode uses CPU batch NTT and direct CPU consistency arithmetic.
- CUDA mode performs the two response encodings with CUDA batch INTT/NTT.
- CUDA mode checks each query phase with one CUDA
matmul; overlapping query columns reuse the same authenticated opening. - Merkle path reconstruction and Fiat-Shamir remain on CPU because their messages are small.
The verifier never receives the full polynomial or full encoded matrix.
- Single ICICLE field representation: the complete project uses
icicle_bn254::ScalarField; there are no Arkworks/ICICLE conversions in the protocol path. - Batch all RS rows: Commit performs one batched INTT and one batched NTT instead of launching transforms row by row.
- Keep CUDA encoding on device: zero-padding and both matrix transposes run on CUDA between INTT, NTT, and Merkle hashing.
- Hash device fields directly: CUDA Merkle leaf hashing reinterprets the canonical ICICLE field buffer as bytes without rebuilding or reuploading a serialized leaf array.
- Retain the polynomial matrix on CUDA: large/xlarge Prove computes both row combinations with CUDA matrix multiplication.
- Use a workload threshold: small/medium Prove stays on CPU to avoid GPU launch overhead; large/xlarge uses CUDA.
- Batch Verify response encoding: testing and evaluation responses are encoded together as one two-row INTT/NTT batch.
- Batch CUDA consistency equations: every queried column in one proof phase is checked by one matrix multiplication.
- Deduplicate queried columns: the independent testing and evaluation query sets are merged only for proof storage. Each unique column and Merkle path is carried and verified once, then reused by both consistency phases.
- Retain Merkle hash layers: opening reads authentication paths without rehashing leaves or invoking one-proof APIs repeatedly.
- Verify Merkle paths on CPU: the queried batches are small, so CPU Keccak/Blake2s avoids unnecessary GPU transfer and synchronization.
- Keep Fiat-Shamir on CPU: small transcript hashes avoid CUDA launch cost while preserving identical bytes and challenges.
- Cache backend and NTT setup: CUDA libraries load once, devices switch without rescanning libraries, and initialized NTT domains are reused.
These optimizations preserve the field, cosets, Reed-Solomon code, Merkle hash definitions, transcript, independent query sets, verification equations, and security assumptions. Query deduplication changes only the proof's storage representation by omitting duplicate copies of an already authenticated column.
These are development benchmark profiles, not final production security recommendations.
All profiles use:
rho_inv = 2
RS rate rho = 1/2
encoded_cols = 2 * matrix_cols
two independent query sets
| Profile | ell |
n |
Matrix | Encoded matrix | Queries/phase | Commit iters | Prove iters | Verify iters |
|---|---|---|---|---|---|---|---|---|
| small | 8 | 256 | 16 x 16 | 16 x 32 | 8 | 20 | 20 | 100 |
| medium | 12 | 4,096 | 64 x 64 | 64 x 128 | 16 | 10 | 10 | 50 |
| large | 16 | 65,536 | 256 x 256 | 256 x 512 | 32 | 10 | 10 | 10 |
| xlarge | 20 | 1,048,576 | 1,024 x 1,024 | 1,024 x 2,048 | 64 | 10 | 10 | 10 |
benchmark all runs all four profiles.
Measured with release builds. These values are from the second confirming run after retaining query deduplication and restoring independent consistency checks plus direct Rust Merkle verification.
CPU:
CPU: Intel Core i9-14900, 32 logical CPUs
OS: Linux under WSL2
Backend: ICICLE CPU
| Profile | Commit | Evaluation proof | Verify | Commitment | Opening |
|---|---|---|---|---|---|
| small | 10.918 ms | 0.117 ms | 3.632 ms | 32 B | 10,576 B |
| medium | 40.414 ms | 0.808 ms | 4.539 ms | 32 B | 70,248 B |
| large | 171.328 ms | 7.875 ms | 10.668 ms | 32 B | 559,648 B |
| xlarge | 1,196.924 ms | 95.812 ms | 56.798 ms | 32 B | 4,239,696 B |
CUDA:
GPU: NVIDIA RTX 2000 Ada Generation, 16 GB
Driver: NVIDIA 581.42
CUDA Toolkit: 12.6
Backend: ICICLE v4.0.0 ubuntu22-cuda122
| Profile | Commit | Evaluation proof | Verify | Commitment | Opening |
|---|---|---|---|---|---|
| small | 0.881 ms | 0.115 ms | 0.667 ms | 32 B | 10,576 B |
| medium | 0.978 ms | 0.804 ms | 1.445 ms | 32 B | 70,248 B |
| large | 5.519 ms | 5.176 ms | 8.111 ms | 32 B | 559,648 B |
| xlarge | 60.239 ms | 30.687 ms | 49.068 ms | 32 B | 4,239,696 B |
For xlarge, the complete measured workflow with CUDA has about a 10.1x total speedup. Commit is about 20.7x faster and Prove about 3.3x faster.
Commitment size is the 32-byte root. Opening size includes the claimed value, both response vectors, unique opened columns, indices, and Merkle paths, but excludes external container overhead. The size reduction depends on how many columns overlap between the two deterministic Fiat-Shamir query sets; the large profile has no overlap for this benchmark input.
The CPU measurements use ICICLE's CPU NTT backend and should not be compared as an Arkworks FFT benchmark. The CUDA measurements show the intended acceleration path for the unified ICICLE field and NTT implementation.
Run all correctness tests:
cargo test --workspaceRun formatting and Clippy:
cargo fmt --all -- --check
cargo clippy --workspace --all-targets -- -D warningsRun an end-to-end demo:
cargo run --release -p cli -- demo small
cargo run --release -p cli -- demo xlargeRun all four CPU benchmarks:
cargo run --release -p cli -- benchmark allRun all four CUDA benchmarks:
ICICLE_BACKEND_INSTALL_DIR=/tmp/icicle-v4-cuda/icicle/lib/backend cargo run --release -p cli -- benchmark all --backend cuda --device-id 0Build the workspace:
cargo build --release --workspaceThe implementation includes the complete square-matrix t = 2 Shockwave PCS
workflow: ICICLE BN254 arithmetic, coset RS encoding, full polynomial opening,
non-interactive evaluation proofs, CPU/CUDA NTT, batched Merkle commitments,
CPU Fiat-Shamir, workload-aware CPU/CUDA Prove, CUDA Verify RS/consistency
checks, unique-query proof storage, owned Rust proofs, benchmarks,
compatibility tests, and negative tests.
The following remain future work:
- Rectangular matrix proof-size optimization.
- Merkle multiproofs that also deduplicate shared internal path nodes.
- Automatic query-count derivation from target soundness bits.
- A standardized wire encoding for the owned proof structs.
- Systematic Reed-Solomon encoding.