Skip to content

setup: skip sm_103 cubin on CUDA 13.x (miscompile produces wrong fwd/update output on B300)#106

Open
vai-minzhou wants to merge 1 commit into
Dao-AILab:mainfrom
vai-minzhou:fix-sm103-codegen
Open

setup: skip sm_103 cubin on CUDA 13.x (miscompile produces wrong fwd/update output on B300)#106
vai-minzhou wants to merge 1 commit into
Dao-AILab:mainfrom
vai-minzhou:fix-sm103-codegen

Conversation

@vai-minzhou
Copy link
Copy Markdown

What

Skip the explicit compute_103,sm_103 gencode in setup.py when CUDA 13+ is detected. Source builds picked up the broken sm_103 cubin (nvcc 13.x miscompile of fwd/update) on B300 and silently returned wrong output; the published wheels were built with CUDA <13 and shipped only the sm_100 cubin, which is forward-compatible to sm_103 within the Blackwell family and is correct.

Fixes #105.

Repro

pip install causal-conv1d==1.6.1 (works on B300, max_abs_diff=0) vs pip install causal-conv1d --no-binary causal-conv1d with CUDA 13.2 (broken on B300, max_abs_diff ≈ 0.6). Full per-arch matrix in #105.

Test plan

On B300 SXM6 AC, CUDA 13.2 V13.2.51, driver 580.x, BF16:

fwd_bf16_width4_no_init                                         PASS  max_abs_diff = 0
fwd_bf16_silu_no_init                                           PASS  max_abs_diff = 0
update_bf16_width4                                              PASS  state bit-exact
fwd_bf16_initial_states_silently_ignored_in_channel_first       PASS  drift_vs_no_init = 0

vs upstream main on the same hardware: all 4 fail with max_abs_err = 0.4–0.6.

Risk

Low — the change is gencode-only. sm_100 is documented as forward-compatible to sm_103 within Blackwell, and the published wheels for 1.6.1 already rely on this for B300 today. Restoring sm_103 after the nvcc bug is fixed is a one-line revert.

Why not fix the kernel instead

The kernel source compiles cleanly for sm_100 and sm_103 from the same TU; the sm_100 cubin is correct, the sm_103 cubin is not. That points squarely at nvcc's SM103 codegen path (or the CUB version shipped with CUDA 13.2 picking a broken WARP_TRANSPOSE layout for SM103). Working around the codegen bug here is a one-line, reversible change; root-causing the toolchain bug shouldn't block correctness on shipping hardware.

nvcc 13.2 produces a numerically-broken cubin for `causal_conv1d_fwd` and
`causal_conv1d_update` when building for `compute_103,sm_103`. On B300
(SM103, "Blackwell-Ultra") the loader prefers the exact-match sm_103
cubin from the fatbin, so ever since Dao-AILab#71 added the SM103 gencode any
source build with CUDA 13+ has been silently producing wrong output:

  fwd_bf16_width4_no_init                  max_abs_err = 6.05e-01
  fwd_bf16_silu_no_init                    max_abs_err = 4.09e-01
  update_bf16_width4         out_err=4.47e-01  state_err=9.79e-01

The wheels published to PyPI were built with CUDA <13 so they only
shipped the sm_100 cubin and fell through to it on B300 without anyone
noticing. The bug only surfaces on source builds (`pip install
causal-conv1d --no-binary causal-conv1d` with a CUDA 13.x toolchain).

The sm_100 cubin is forward-compatible to sm_103 within the Blackwell
family and produces bit-exact correct output on B300. Drop the explicit
`compute_103,sm_103` gencode from the `bare_metal_version >= 13.0`
branch and rely on Blackwell's intra-family forward-compat. Keep the
sm_110 + sm_121 gencodes — those are unrelated. Restoring sm_103 should
wait until the upstream nvcc bug is fixed (or the failure is bisected
to a CUB layout choice).

Verified on a B300 SXM6 AC, CUDA 13.2 V13.2.51, driver 580.x:
  - Before: full test suite fails with the errors above.
  - After:  bit-exact (`max_abs_diff = 0`) against the F.conv1d
    reference; full test_causal_conv1d.py passes.

See Dao-AILab#105 for the full repro and a per-arch correctness matrix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

B300 (SM103) numerical correctness: nvcc 13.x sm_103 cubin produces wrong output for fwd/update kernels

1 participant