Skip to content

[Revert] nvidia-cutlass-dsl[cu13] 4.5.1 -> 4.5.0#25938

Merged
Kangyan-Zhou merged 2 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_cutlass_dsl_variant
May 21, 2026
Merged

[Revert] nvidia-cutlass-dsl[cu13] 4.5.1 -> 4.5.0#25938
Kangyan-Zhou merged 2 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_cutlass_dsl_variant

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

@Kangyan-Zhou Kangyan-Zhou commented May 21, 2026

Problem

nvidia-cutlass-dsl[cu13] has additive extras on PyPI: both -libs-base AND -libs-cu13 are installed together when [cu13] is requested. They write to the same site-packages paths with different content, causing a GPUModuleOp TypeError at kernel-compile time (vllm-project/vllm#40082).

The correct libs package to keep depends on GPU family:

Runner Required libs Why
Blackwell (IS_BLACKWELL=1, CU13) -libs-cu13 wins Provides sm_110 arch alias missing in CUDA-12.9-built -libs-base
Non-Blackwell CU13 (H100, H200) -libs-base wins -libs-cu13 causes CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS in LoRA CUDA-graph capture (#25743)
Non-CU13 No-op Only -libs-base installed, no conflict

History

Changes

Add fix_cutlass_dsl_libs() called from main() after download_flashinfer_cache:

  • IS_BLACKWELL=1, CU13: purge -libs-base, force-reinstall -libs-cu13 → sm_110 support
  • IS_BLACKWELL=0, CU13: purge -libs-cu13, force-reinstall -libs-base → no TypeError, no LoRA regression
  • Non-CU13: early return, no-op

Test Plan

  • H100 CI (base-b-test-*-gpu-large): eagle tests should pass (no more GPUModuleOp TypeError), LoRA tests should remain green
  • B200/GB300 CI (base-b-test-*-gpu-b200): sm_110 alias supported, cutlass import works

🤖 Generated with Claude Code


CI States

Latest PR Test (Base): ⏳ Run #26208799937
Latest PR Test (Extra): ❌ Run #26208799771

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@Kangyan-Zhou Kangyan-Zhou force-pushed the fix_cutlass_dsl_variant branch from dc34579 to dc02d67 Compare May 21, 2026 04:47
@Kangyan-Zhou Kangyan-Zhou changed the title [CI] Force-reinstall nvidia-cutlass-dsl with correct CUDA variant in ci_install_dependency.sh [CI] Fix nvidia-cutlass-dsl libs conflict per GPU family in ci_install_dependency.sh May 21, 2026
@mmangkad
Copy link
Copy Markdown
Contributor

Should force reinstall nvidia-cutlass-dsl-libs-cu13 last otherwise it may still have issues

PR sgl-project#25576 bumped nvidia-cutlass-dsl[cu13] from 4.5.0 to 4.5.1. The bump
exposed a latent file-level conflict between -libs-base and -libs-cu13
(both written by the additive [cu13] extra) as a hard GPUModuleOp
TypeError on H100: -libs-cu13's pybind11 binding changed to the new
MLIR-style ((operation: object)) without a matching bump to the Python
wrapper in nvidia-cutlass-dsl, so loading -libs-cu13's .so makes the
wrapper's old-style super().__init__() call fail.

Two changes:

1. Revert the version bump (4.5.1 -> 4.5.0). At 4.5.0 both .so files
   expose a compatible binding, so the same coexistence no longer crashes.
   This removes the active TypeError on H100 and on the CUDA-13 Docker
   image for non-Blackwell users.

2. Add fix_cutlass_dsl_libs() to ci_install_dependency.sh, called from
   main() after download_flashinfer_cache. The function picks the right
   libs package per GPU family even at 4.5.0 to avoid two independent
   regressions that the silent conflict could still hit:

     Blackwell (IS_BLACKWELL=1, CU13):
       Purge -libs-base, force-reinstall -libs-cu13 so its files take
       precedence. -libs-base is CUDA-12.9-built and lacks the sm_110
       arch alias that GB300/B200 need at cutlass import time.

     Non-Blackwell CU13 (H100, H200):
       Purge -libs-cu13, force-reinstall -libs-base. -libs-cu13 carries
       a CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression in LoRA CUDA-
       graph capture on sm_90 (sgl-project#25743 / reverted by sgl-project#25756).

     Non-CU13: no-op (only -libs-base ever installed).
@Kangyan-Zhou Kangyan-Zhou force-pushed the fix_cutlass_dsl_variant branch from e45a5d3 to b04b737 Compare May 21, 2026 06:06
@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label May 21, 2026
Revert the version bump from PR sgl-project#25576. At 4.5.1, -libs-cu13's pybind11
binding changed to new MLIR-style ((operation: object)) without a
matching bump to the Python wrapper in nvidia-cutlass-dsl, exposing the
latent file-level conflict between -libs-base and -libs-cu13 (both
written by the additive [cu13] extra) as a hard GPUModuleOp TypeError
at kernel-compile time on CU13 runners.

At 4.5.0 both .so files expose a compatible binding, so the same
coexistence is silent and CI was empirically green on H100 and Blackwell
during the post-sgl-project#25756, pre-sgl-project#25576 window. Going back to 4.5.0 restores
that state.

Supersedes sgl-project#25935 (which proposed the same revert but was closed).
@Kangyan-Zhou Kangyan-Zhou force-pushed the fix_cutlass_dsl_variant branch from 0e366cc to da60371 Compare May 21, 2026 06:10
@Kangyan-Zhou Kangyan-Zhou changed the title [CI] Fix nvidia-cutlass-dsl libs conflict per GPU family in ci_install_dependency.sh [Revert] nvidia-cutlass-dsl[cu13] 4.5.1 -> 4.5.0 May 21, 2026
@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

Should force reinstall nvidia-cutlass-dsl-libs-cu13 last otherwise it may still have issues

Let's revert this change first and then try your suggestions, this error is very confusing

@Kangyan-Zhou Kangyan-Zhou merged commit 4ea8282 into sgl-project:main May 21, 2026
94 of 123 checks passed
Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request May 21, 2026
- Use "${REPO_ROOT}/python/pyproject.toml" instead of relative path so the
  version probe doesn't depend on the working directory the script is
  launched from (per gemini-code-assist review).
- Bump nvidia-cutlass-dsl[cu13] 4.5.0 -> 4.5.1 now that the wheel-mix
  TypeError is mitigated by force_reinstall_cutlass_dsl_libs_cu13. This
  re-applies sgl-project#25576 which was rolled back in sgl-project#25938 only because of the
  install-order bug.
alisonshao pushed a commit that referenced this pull request May 21, 2026
- Use "${REPO_ROOT}/python/pyproject.toml" instead of relative path so the
  version probe doesn't depend on the working directory the script is
  launched from (per gemini-code-assist review).
- Bump nvidia-cutlass-dsl[cu13] 4.5.0 -> 4.5.1 now that the wheel-mix
  TypeError is mitigated by force_reinstall_cutlass_dsl_libs_cu13. This
  re-applies #25576 which was rolled back in #25938 only because of the
  install-order bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bypass-fastfail dependencies Pull requests that update a dependency file high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants