Skip to content

fix(zisk): export CUDA_ARCH as CUDA_ARCHS env for cargo build#336

Open
qu0b wants to merge 1 commit intoeth-act:masterfrom
qu0b:fix/zisk-dockerfile-cuda-arch-dead-plumbing
Open

fix(zisk): export CUDA_ARCH as CUDA_ARCHS env for cargo build#336
qu0b wants to merge 1 commit intoeth-act:masterfrom
qu0b:fix/zisk-dockerfile-cuda-arch-dead-plumbing

Conversation

@qu0b
Copy link
Copy Markdown

@qu0b qu0b commented Apr 21, 2026

Summary

ARG CUDA_ARCH=sm_120 in docker/zisk/Dockerfile.server (and Dockerfile.cluster) is dead plumbing — the ARG is declared but never exported into the RUN cargo build … environment, so it has no effect.

As a result, the --cuda-archs N flag accepted by .github/scripts/build-image.sh is silently ignored for zisk. The build always produces an image whose embedded CUDA kernels target sm_120 (the committed default in pil2-stark/src/goldilocks/CudaArch.mk), regardless of what was requested.

Symptom

The published ghcr.io/eth-act/ere/ere-server-zisk:*-cuda images fail on any GPU that is not compute capability 12.0 (RTX 5090 / consumer Blackwell) with:

[CUDA] cudaMemcpyToSymbol(GPU_C_4, …) failed due to:
  no kernel image is available for execution on the device (209)
  at src/goldilocks/src/poseidon2_goldilocks.cu:66

This was hit locally on 2× RTX 4090 (sm_89) even when --cuda-archs 89 was explicitly passed to build-image.sh.

Fix

Strip the sm_ prefix off CUDA_ARCH and export it as CUDA_ARCHS (plural, numeric) inline on the cargo build command. That env var is what proofman-starks-lib-c/build.rs reads to generate the correct nvcc -gencode flags.

-RUN cargo build --release --package ere-server --bin ere-server --features zisk${CUDA:+,cuda} \
+RUN CUDA_ARCHS="${CUDA_ARCH#sm_}" \
+    cargo build --release --package ere-server --bin ere-server --features zisk${CUDA:+,cuda} \

Same one-line change in Dockerfile.cluster.

Root cause detail

The committed pil2-stark/src/goldilocks/CudaArch.mk hardcodes CUDA_ARCH = sm_120. The auto-detect path in configure.sh requires deviceQuery, which needs a runtime GPU — not available during a Docker build. So without CUDA_ARCHS being set, the Makefile falls back to the committed default and silently produces an sm_120 image.

Verification

Built locally with build-image.sh --zkvm zisk --tag local-sm89-cuda --base --server --cuda-archs 89 and confirmed via cuobjdump --list-elf that all 15 embedded .cubin ELF sections in /ere/bin/ere-server now report sm_89.

Running the resulting image on 2× RTX 4090 via the ethpandaops/ethereum-package zkboost stack (EIP-8025 testnet) produced real proofs:

zkboost_prove_total{proof_type="reth-zisk", status="success"} 3
zkboost_prove_duration_seconds_sum = 34.47   # ~11.5s per proof

Test plan

  • build-image.sh … --cuda-archs 89 produces an image whose kernels are sm_89 (verified with cuobjdump)
  • Resulting image runs on RTX 4090 without CUDA error 209
  • End-to-end zkboost + lighthouse + reth pipeline generates and verifies proofs
  • Upstream CI builds cleanly for --cuda-archs 120 (unchanged default behavior)

The `ARG CUDA_ARCH=sm_120` declared in docker/zisk/Dockerfile.server
and Dockerfile.cluster is never actually passed to the cargo build: an
ARG is a build-time variable and does not become a RUN env var without
explicit export. As a result, the zisk CUDA kernel build silently falls
back to the committed sm_120 default in pil2-stark/src/goldilocks/CudaArch.mk
regardless of what --cuda-archs is passed to build-image.sh.

The symptom is that the published ere-server-zisk:*-cuda images only
contain sm_120 kernels and fail on any other GPU with CUDA error 209
(no kernel image is available for execution on the device), even when
CUDA_ARCH was supposedly overridden at build time.

This plumbs the value through: strip the `sm_` prefix and set CUDA_ARCHS
(plural, numeric) for the RUN, which is the env var read by
proofman-starks-lib-c/build.rs to generate nvcc -gencode flags.

Verified by building with `--cuda-archs 89` and confirming all 15
embedded .cubin ELF sections in the resulting ere-server binary report
sm_89 (via `cuobjdump --list-elf`). Running on an RTX 4090 now produces
proofs instead of failing with CUDA 209.
@han0110
Copy link
Copy Markdown
Collaborator

han0110 commented Apr 21, 2026

Thanks for the fix! I was not aware that ZisK now supports multiple cuda archs codgen by setting CUDA_ARCHS, and the old CUDA_ARCH env is ignored so you encountered the error even rebuilding with --cuda-archs 89.

I opened another PR in #337 to build and publish the image with CUDA_ARCHS=89,120, and after that we should be able to run the published image on 4090 directly.

@han0110
Copy link
Copy Markdown
Collaborator

han0110 commented Apr 21, 2026

Could you try again on the image with fix of #337? It should support 4090 now without building it locally. https://github.com/eth-act/ere/pkgs/container/ere%2Fere-server-zisk/811652646?tag=8401f02-cuda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants