fix(zisk): export CUDA_ARCH as CUDA_ARCHS env for cargo build#336
Open
qu0b wants to merge 1 commit intoeth-act:masterfrom
Open
fix(zisk): export CUDA_ARCH as CUDA_ARCHS env for cargo build#336qu0b wants to merge 1 commit intoeth-act:masterfrom
qu0b wants to merge 1 commit intoeth-act:masterfrom
Conversation
The `ARG CUDA_ARCH=sm_120` declared in docker/zisk/Dockerfile.server and Dockerfile.cluster is never actually passed to the cargo build: an ARG is a build-time variable and does not become a RUN env var without explicit export. As a result, the zisk CUDA kernel build silently falls back to the committed sm_120 default in pil2-stark/src/goldilocks/CudaArch.mk regardless of what --cuda-archs is passed to build-image.sh. The symptom is that the published ere-server-zisk:*-cuda images only contain sm_120 kernels and fail on any other GPU with CUDA error 209 (no kernel image is available for execution on the device), even when CUDA_ARCH was supposedly overridden at build time. This plumbs the value through: strip the `sm_` prefix and set CUDA_ARCHS (plural, numeric) for the RUN, which is the env var read by proofman-starks-lib-c/build.rs to generate nvcc -gencode flags. Verified by building with `--cuda-archs 89` and confirming all 15 embedded .cubin ELF sections in the resulting ere-server binary report sm_89 (via `cuobjdump --list-elf`). Running on an RTX 4090 now produces proofs instead of failing with CUDA 209.
Collaborator
|
Thanks for the fix! I was not aware that ZisK now supports multiple cuda archs codgen by setting I opened another PR in #337 to build and publish the image with |
Collaborator
|
Could you try again on the image with fix of #337? It should support 4090 now without building it locally. https://github.com/eth-act/ere/pkgs/container/ere%2Fere-server-zisk/811652646?tag=8401f02-cuda |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ARG CUDA_ARCH=sm_120indocker/zisk/Dockerfile.server(andDockerfile.cluster) is dead plumbing — the ARG is declared but never exported into theRUN cargo build …environment, so it has no effect.As a result, the
--cuda-archs Nflag accepted by.github/scripts/build-image.shis silently ignored for zisk. The build always produces an image whose embedded CUDA kernels targetsm_120(the committed default inpil2-stark/src/goldilocks/CudaArch.mk), regardless of what was requested.Symptom
The published
ghcr.io/eth-act/ere/ere-server-zisk:*-cudaimages fail on any GPU that is not compute capability 12.0 (RTX 5090 / consumer Blackwell) with:This was hit locally on 2× RTX 4090 (
sm_89) even when--cuda-archs 89was explicitly passed tobuild-image.sh.Fix
Strip the
sm_prefix offCUDA_ARCHand export it asCUDA_ARCHS(plural, numeric) inline on the cargo build command. That env var is whatproofman-starks-lib-c/build.rsreads to generate the correctnvcc -gencodeflags.Same one-line change in
Dockerfile.cluster.Root cause detail
The committed
pil2-stark/src/goldilocks/CudaArch.mkhardcodesCUDA_ARCH = sm_120. The auto-detect path inconfigure.shrequiresdeviceQuery, which needs a runtime GPU — not available during a Docker build. So withoutCUDA_ARCHSbeing set, the Makefile falls back to the committed default and silently produces an sm_120 image.Verification
Built locally with
build-image.sh --zkvm zisk --tag local-sm89-cuda --base --server --cuda-archs 89and confirmed viacuobjdump --list-elfthat all 15 embedded.cubinELF sections in/ere/bin/ere-servernow reportsm_89.Running the resulting image on 2× RTX 4090 via the
ethpandaops/ethereum-packagezkboost stack (EIP-8025 testnet) produced real proofs:Test plan
build-image.sh … --cuda-archs 89produces an image whose kernels are sm_89 (verified with cuobjdump)--cuda-archs 120(unchanged default behavior)