From 0c48312d8755e10c13c475cdbd37deb75dada7ba Mon Sep 17 00:00:00 2001 From: Stefan Date: Tue, 21 Apr 2026 12:03:37 +0200 Subject: [PATCH] fix(zisk): export CUDA_ARCH as CUDA_ARCHS env for cargo build The `ARG CUDA_ARCH=sm_120` declared in docker/zisk/Dockerfile.server and Dockerfile.cluster is never actually passed to the cargo build: an ARG is a build-time variable and does not become a RUN env var without explicit export. As a result, the zisk CUDA kernel build silently falls back to the committed sm_120 default in pil2-stark/src/goldilocks/CudaArch.mk regardless of what --cuda-archs is passed to build-image.sh. The symptom is that the published ere-server-zisk:*-cuda images only contain sm_120 kernels and fail on any other GPU with CUDA error 209 (no kernel image is available for execution on the device), even when CUDA_ARCH was supposedly overridden at build time. This plumbs the value through: strip the `sm_` prefix and set CUDA_ARCHS (plural, numeric) for the RUN, which is the env var read by proofman-starks-lib-c/build.rs to generate nvcc -gencode flags. Verified by building with `--cuda-archs 89` and confirming all 15 embedded .cubin ELF sections in the resulting ere-server binary report sm_89 (via `cuobjdump --list-elf`). Running on an RTX 4090 now produces proofs instead of failing with CUDA 209. --- docker/zisk/Dockerfile.cluster | 7 +++++-- docker/zisk/Dockerfile.server | 9 ++++++++- 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/docker/zisk/Dockerfile.cluster b/docker/zisk/Dockerfile.cluster index db57f0a2..9fb20f6b 100644 --- a/docker/zisk/Dockerfile.cluster +++ b/docker/zisk/Dockerfile.cluster @@ -57,8 +57,11 @@ ARG CUDA # Default to build for RTX 50 series ARG CUDA_ARCH=sm_120 -# Build binaries -RUN cargo build --release ${CUDA:+--features gpu} +# Strip the sm_ prefix and export as CUDA_ARCHS (plural, numeric) which is the +# env var proofman-starks-lib-c/build.rs reads to generate nvcc -gencode flags. +# See note in docker/zisk/Dockerfile.server for full rationale. +RUN CUDA_ARCHS="${CUDA_ARCH#sm_}" \ + cargo build --release ${CUDA:+--features gpu} FROM $RUNTIME_IMAGE AS runtime FROM $RUNTIME_CUDA_IMAGE AS runtime_cuda diff --git a/docker/zisk/Dockerfile.server b/docker/zisk/Dockerfile.server index aceff040..d314431f 100644 --- a/docker/zisk/Dockerfile.server +++ b/docker/zisk/Dockerfile.server @@ -17,7 +17,14 @@ ARG RUSTFLAGS # Default to build for RTX 50 series ARG CUDA_ARCH=sm_120 -RUN cargo build --release --package ere-server --bin ere-server --features zisk${CUDA:+,cuda} \ +# Strip the sm_ prefix and export as CUDA_ARCHS (plural, numeric) which is the +# env var proofman-starks-lib-c/build.rs reads to generate nvcc -gencode flags. +# Without this export, CUDA_ARCH is declared-but-unused and the build silently +# falls back to the committed sm_120 default in pil2-stark's CudaArch.mk, +# producing an image that fails on any non-sm_120 GPU with CUDA error 209 +# (no kernel image is available for execution on the device). +RUN CUDA_ARCHS="${CUDA_ARCH#sm_}" \ + cargo build --release --package ere-server --bin ere-server --features zisk${CUDA:+,cuda} \ && mkdir bin && mv target/release/ere-server bin/ere-server \ && cargo clean && rm -rf $CARGO_HOME/registry/