setup: skip sm_103 cubin on CUDA 13.x (miscompile produces wrong fwd/update output on B300)#106
Open
vai-minzhou wants to merge 1 commit into
Open
setup: skip sm_103 cubin on CUDA 13.x (miscompile produces wrong fwd/update output on B300)#106vai-minzhou wants to merge 1 commit into
vai-minzhou wants to merge 1 commit into
Conversation
nvcc 13.2 produces a numerically-broken cubin for `causal_conv1d_fwd` and `causal_conv1d_update` when building for `compute_103,sm_103`. On B300 (SM103, "Blackwell-Ultra") the loader prefers the exact-match sm_103 cubin from the fatbin, so ever since Dao-AILab#71 added the SM103 gencode any source build with CUDA 13+ has been silently producing wrong output: fwd_bf16_width4_no_init max_abs_err = 6.05e-01 fwd_bf16_silu_no_init max_abs_err = 4.09e-01 update_bf16_width4 out_err=4.47e-01 state_err=9.79e-01 The wheels published to PyPI were built with CUDA <13 so they only shipped the sm_100 cubin and fell through to it on B300 without anyone noticing. The bug only surfaces on source builds (`pip install causal-conv1d --no-binary causal-conv1d` with a CUDA 13.x toolchain). The sm_100 cubin is forward-compatible to sm_103 within the Blackwell family and produces bit-exact correct output on B300. Drop the explicit `compute_103,sm_103` gencode from the `bare_metal_version >= 13.0` branch and rely on Blackwell's intra-family forward-compat. Keep the sm_110 + sm_121 gencodes — those are unrelated. Restoring sm_103 should wait until the upstream nvcc bug is fixed (or the failure is bisected to a CUB layout choice). Verified on a B300 SXM6 AC, CUDA 13.2 V13.2.51, driver 580.x: - Before: full test suite fails with the errors above. - After: bit-exact (`max_abs_diff = 0`) against the F.conv1d reference; full test_causal_conv1d.py passes. See Dao-AILab#105 for the full repro and a per-arch correctness matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Skip the explicit
compute_103,sm_103gencode insetup.pywhen CUDA 13+ is detected. Source builds picked up the broken sm_103 cubin (nvcc 13.x miscompile offwd/update) on B300 and silently returned wrong output; the published wheels were built with CUDA <13 and shipped only the sm_100 cubin, which is forward-compatible to sm_103 within the Blackwell family and is correct.Fixes #105.
Repro
pip install causal-conv1d==1.6.1(works on B300, max_abs_diff=0) vspip install causal-conv1d --no-binary causal-conv1dwith CUDA 13.2 (broken on B300, max_abs_diff ≈ 0.6). Full per-arch matrix in #105.Test plan
On B300 SXM6 AC, CUDA 13.2 V13.2.51, driver 580.x, BF16:
vs upstream
mainon the same hardware: all 4 fail withmax_abs_err = 0.4–0.6.Risk
Low — the change is gencode-only. sm_100 is documented as forward-compatible to sm_103 within Blackwell, and the published wheels for 1.6.1 already rely on this for B300 today. Restoring
sm_103after the nvcc bug is fixed is a one-line revert.Why not fix the kernel instead
The kernel source compiles cleanly for
sm_100andsm_103from the same TU; the sm_100 cubin is correct, the sm_103 cubin is not. That points squarely at nvcc's SM103 codegen path (or the CUB version shipped with CUDA 13.2 picking a broken WARP_TRANSPOSE layout for SM103). Working around the codegen bug here is a one-line, reversible change; root-causing the toolchain bug shouldn't block correctness on shipping hardware.