setup: skip sm_103 cubin on CUDA 13.x (miscompile produces wrong fwd/update output on B300) by vai-minzhou · Pull Request #106 · Dao-AILab/causal-conv1d

vai-minzhou · 2026-04-30T21:59:22Z

What

Skip the explicit compute_103,sm_103 gencode in setup.py when CUDA 13+ is detected. Source builds picked up the broken sm_103 cubin (nvcc 13.x miscompile of fwd/update) on B300 and silently returned wrong output; the published wheels were built with CUDA <13 and shipped only the sm_100 cubin, which is forward-compatible to sm_103 within the Blackwell family and is correct.

Fixes #105.

Repro

pip install causal-conv1d==1.6.1 (works on B300, max_abs_diff=0) vs pip install causal-conv1d --no-binary causal-conv1d with CUDA 13.2 (broken on B300, max_abs_diff ≈ 0.6). Full per-arch matrix in #105.

Test plan

On B300 SXM6 AC, CUDA 13.2 V13.2.51, driver 580.x, BF16:

fwd_bf16_width4_no_init                                         PASS  max_abs_diff = 0
fwd_bf16_silu_no_init                                           PASS  max_abs_diff = 0
update_bf16_width4                                              PASS  state bit-exact
fwd_bf16_initial_states_silently_ignored_in_channel_first       PASS  drift_vs_no_init = 0

vs upstream main on the same hardware: all 4 fail with max_abs_err = 0.4–0.6.

Risk

Low — the change is gencode-only. sm_100 is documented as forward-compatible to sm_103 within Blackwell, and the published wheels for 1.6.1 already rely on this for B300 today. Restoring sm_103 after the nvcc bug is fixed is a one-line revert.

Why not fix the kernel instead

The kernel source compiles cleanly for sm_100 and sm_103 from the same TU; the sm_100 cubin is correct, the sm_103 cubin is not. That points squarely at nvcc's SM103 codegen path (or the CUB version shipped with CUDA 13.2 picking a broken WARP_TRANSPOSE layout for SM103). Working around the codegen bug here is a one-line, reversible change; root-causing the toolchain bug shouldn't block correctness on shipping hardware.

nvcc 13.2 produces a numerically-broken cubin for `causal_conv1d_fwd` and `causal_conv1d_update` when building for `compute_103,sm_103`. On B300 (SM103, "Blackwell-Ultra") the loader prefers the exact-match sm_103 cubin from the fatbin, so ever since Dao-AILab#71 added the SM103 gencode any source build with CUDA 13+ has been silently producing wrong output: fwd_bf16_width4_no_init max_abs_err = 6.05e-01 fwd_bf16_silu_no_init max_abs_err = 4.09e-01 update_bf16_width4 out_err=4.47e-01 state_err=9.79e-01 The wheels published to PyPI were built with CUDA <13 so they only shipped the sm_100 cubin and fell through to it on B300 without anyone noticing. The bug only surfaces on source builds (`pip install causal-conv1d --no-binary causal-conv1d` with a CUDA 13.x toolchain). The sm_100 cubin is forward-compatible to sm_103 within the Blackwell family and produces bit-exact correct output on B300. Drop the explicit `compute_103,sm_103` gencode from the `bare_metal_version >= 13.0` branch and rely on Blackwell's intra-family forward-compat. Keep the sm_110 + sm_121 gencodes — those are unrelated. Restoring sm_103 should wait until the upstream nvcc bug is fixed (or the failure is bisected to a CUB layout choice). Verified on a B300 SXM6 AC, CUDA 13.2 V13.2.51, driver 580.x: - Before: full test suite fails with the errors above. - After: bit-exact (`max_abs_diff = 0`) against the F.conv1d reference; full test_causal_conv1d.py passes. See Dao-AILab#105 for the full repro and a per-arch correctness matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setup: skip sm_103 cubin on CUDA 13.x (miscompile produces wrong fwd/update output on B300)#106

setup: skip sm_103 cubin on CUDA 13.x (miscompile produces wrong fwd/update output on B300)#106
vai-minzhou wants to merge 1 commit into
Dao-AILab:mainfrom
vai-minzhou:fix-sm103-codegen

vai-minzhou commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vai-minzhou commented Apr 30, 2026

What

Repro

Test plan

Risk

Why not fix the kernel instead

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant