[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError by Kangyan-Zhou · Pull Request #25958 · sgl-project/sglang

Kangyan-Zhou · 2026-05-21T07:13:52Z

Root cause

nvidia-cutlass-dsl[cu13] has additive PyPI extras — installing it pulls in both nvidia-cutlass-dsl-libs-base AND nvidia-cutlass-dsl-libs-cu13. The two wheels ship intentionally-different content for the same paths:

Path	`-libs-base`	`-libs-cu13`
`cutlass/_mlir/dialects/_gpu_ops_gen.py`	calls `super().__init__(self.build_generic(...))` (new-style single object)	calls `super().__init__(OPERATION_NAME, REGIONS, ...)` (old-style positional)
`cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-310-x86_64-linux-gnu.so`	pybind11 binding only accepts `(operation: object)`	pybind11 binding only accepts positional args

Each wheel's .py is paired with a .so that has the matching API. If install order leaves the .py from one wheel and the .so from the other (which can happen via uv's install ordering), you get the hard TypeError seen in CI:

File ".../cutlass/_mlir/dialects/_gpu_ops_gen.py", line 1357, in __init__
    super().__init__(self.OPERATION_NAME, self._ODS_REGIONS, ...)
TypeError: __init__(): incompatible function arguments. The following argument types are supported:
    1. __init__(self, operation: object) -> None

This surfaces at kernel-compile time on CU13 CI runners during eagle / lora tests that go through flashinfer.rmsnorm_cute → cute.compile.

Empirical evidence

Tested all 4 combinations on an H200 devbox by manually cp-ing wheel contents into site-packages:

`.py` from	`.so` from	Smoke test (`gpu.GPUModuleOp(StringAttr, loc=loc)`)
`-libs-base`	`-libs-base`	✅ PASS
`-libs-cu13`	`-libs-cu13`	✅ PASS
`-libs-cu13`	`-libs-base`	❌ FAIL — exact CI TypeError, byte-for-byte
`-libs-base`	`-libs-cu13`	✅ PASS

Three of four states work. Only the mismatched .py=cu13 + .so=base breaks.

Fix

After install_sglang completes (with possibly mismatched state), force-reinstall -libs-cu13 last to guarantee both .py and .so come from the same wheel (BOTH-cu13 state):

$PIP_CMD install --force-reinstall --no-deps \
  "nvidia-cutlass-dsl-libs-cu13==${CUTLASS_DSL_VERSION}" \
  $PIP_INSTALL_SUFFIX

Version parsed from pyproject.toml to stay in sync. Skips for non-CU13 runners (only -libs-base installed there, no conflict possible).

Validation on devbox

TypeError fix: forced BAD state on H200 devbox with UV_LINK_MODE=copy (matches CI), ran force_reinstall_cutlass_dsl_libs_cu13 — smoke test went FAIL → PASS, .so md5 changed from base's to cu13's.
LoRA regression check: ran test/registered/lora/test_lora_qwen3_8b_logprob_diff.py against the fix on the same devbox — both subtests passed, KL divergence 2.8e-4 (threshold 5e-3). The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression from Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743.

Related PRs / supersedes

[Revert] nvidia-cutlass-dsl[cu13] 4.5.1 -> 4.5.0 #25938 (revert-only attempt) — superseded; the version bump wasn't the root cause
[Fix] Try to fix error caused by latest cutedsl packages #25690 / [Fix] Fix extra uninstall of cutlass packages #25756 / Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743 / [Deps] Use cu13 extra for nvidia cutlass dsl #25576 — context for the wheel-mix history

🤖 Generated with Claude Code

CI States

Latest PR Test (Base): ❌ Run #26216901406
Latest PR Test (Extra): ❌ Run #26216901321

…-mix TypeError nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and -libs-cu13 are installed and they ship intentionally-different content for the same site-packages paths: cutlass/_mlir/dialects/_gpu_ops_gen.py cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so Each wrapper .py is paired with a matching pybind11 .so. The two pairs use different MLIR Op constructor styles: -libs-base: super().__init__(self.build_generic(...)) (new-style) -libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style) If install order leaves the .py from one wheel and the .so from the other (reproducible by mixing the wheel contents), the wrapper's super().__init__ call signature does not match what the loaded .so accepts and the runtime raises: TypeError: __init__(): incompatible function arguments. 1. __init__(self, operation: object) -> None surfacing at kernel-compile time on H100 CU13 CI runners during eagle / lora tests that go through flashinfer.rmsnorm_cute -> cute.compile. Tested all 4 (.py, .so) combinations on an H200 devbox: only the mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError byte-for-byte. Three combinations pass. Fix: after install_sglang completes (with possibly mismatched state), force-reinstall -libs-cu13 last so both .py and .so come from the same wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so this stays in sync with whatever nvidia-cutlass-dsl version the project pins. Skips for non-CU13 runners (no [cu13] extra, no conflict). Verified on an H200 devbox: 1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13 -> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's. 2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py -> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3). The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression from sgl-project#25743.

gemini-code-assist

Code Review

This pull request updates the nvidia-cutlass-dsl[cu13] dependency to version 4.5.1 and adds a force_reinstall_cutlass_dsl_libs_cu13 function to the CI installation script to prevent library mismatches. Feedback was provided to use the ${REPO_ROOT} variable for the pyproject.toml file path in the script to ensure it is correctly located regardless of the current working directory.

gemini-code-assist · 2026-05-21T07:16:23Z

+        return
+    fi
+
+    CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "")


Using a relative path for python/pyproject.toml makes the script's behavior dependent on the current working directory. Since REPO_ROOT is already defined and used elsewhere in this script for robustness, it should be used here as well. Additionally, quoting the path is a good practice to handle potential spaces in the directory name.

Suggested change

CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "")

CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\\[[^]]+\\])?==\\K[0-9A-Za-z\\.\\-]+' "${REPO_ROOT}/python/pyproject.toml" || echo "")

Kangyan-Zhou · 2026-05-21T08:52:43Z

@mmangkad I think your suggestion is correct, thanks for sharing it!

Kangyan-Zhou · 2026-05-21T08:52:55Z

/tag-and-rerun-ci

- Use "${REPO_ROOT}/python/pyproject.toml" instead of relative path so the version probe doesn't depend on the working directory the script is launched from (per gemini-code-assist review). - Bump nvidia-cutlass-dsl[cu13] 4.5.0 -> 4.5.1 now that the wheel-mix TypeError is mitigated by force_reinstall_cutlass_dsl_libs_cu13. This re-applies sgl-project#25576 which was rolled back in sgl-project#25938 only because of the install-order bug.

mmangkad · 2026-05-21T08:56:08Z

@mmangkad I think your suggestion is correct, thanks for sharing it!

Yeah that was the issue because the order of install matters, not the version. ~~Could we include the upgrade back to 4.5.1 here?~~ I just saw it

Kangyan-Zhou requested review from Fridge003, ispobock and merrymercy as code owners May 21, 2026 07:13

github-actions Bot added the dependencies Pull requests that update a dependency file label May 21, 2026

Kangyan-Zhou force-pushed the fix_cutlass_libs_install_order branch from de055b3 to 1a0dbf2 Compare May 21, 2026 07:14

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

github-actions Bot added the run-ci label May 21, 2026

Kangyan-Zhou added the bypass-fastfail label May 21, 2026

Merge branch 'main' into fix_cutlass_libs_install_order

13f8cf2

Kangyan-Zhou merged commit caa9f08 into sgl-project:main May 21, 2026
253 of 332 checks passed

nvpohanh mentioned this pull request May 22, 2026

[NVIDIA] [GDN] Enable FlashInfer MTP verify on SM100+ (Blackwell) #23273

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958
Kangyan-Zhou merged 3 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_cutlass_libs_install_order

Kangyan-Zhou commented May 21, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

Kangyan-Zhou commented May 21, 2026

Uh oh!

Kangyan-Zhou commented May 21, 2026

Uh oh!

mmangkad commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml \|\| echo "")
	CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\\[[^]]+\\])?==\\K[0-9A-Za-z\\.\\-]+' "${REPO_ROOT}/python/pyproject.toml" \|\| echo "")

Conversation

Kangyan-Zhou commented May 21, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Empirical evidence

Fix

Validation on devbox

Related PRs / supersedes

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Kangyan-Zhou commented May 21, 2026

Uh oh!

Kangyan-Zhou commented May 21, 2026

Uh oh!

mmangkad commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kangyan-Zhou commented May 21, 2026 •

edited by github-actions Bot

Loading

mmangkad commented May 21, 2026 •

edited

Loading