Skip to content

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958

Merged
Kangyan-Zhou merged 3 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_cutlass_libs_install_order
May 21, 2026
Merged

[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958
Kangyan-Zhou merged 3 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_cutlass_libs_install_order

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

@Kangyan-Zhou Kangyan-Zhou commented May 21, 2026

Root cause

nvidia-cutlass-dsl[cu13] has additive PyPI extras — installing it pulls in both nvidia-cutlass-dsl-libs-base AND nvidia-cutlass-dsl-libs-cu13. The two wheels ship intentionally-different content for the same paths:

Path -libs-base -libs-cu13
cutlass/_mlir/dialects/_gpu_ops_gen.py calls super().__init__(self.build_generic(...)) (new-style single object) calls super().__init__(OPERATION_NAME, REGIONS, ...) (old-style positional)
cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-310-x86_64-linux-gnu.so pybind11 binding only accepts (operation: object) pybind11 binding only accepts positional args

Each wheel's .py is paired with a .so that has the matching API. If install order leaves the .py from one wheel and the .so from the other (which can happen via uv's install ordering), you get the hard TypeError seen in CI:

File ".../cutlass/_mlir/dialects/_gpu_ops_gen.py", line 1357, in __init__
    super().__init__(self.OPERATION_NAME, self._ODS_REGIONS, ...)
TypeError: __init__(): incompatible function arguments. The following argument types are supported:
    1. __init__(self, operation: object) -> None

This surfaces at kernel-compile time on CU13 CI runners during eagle / lora tests that go through flashinfer.rmsnorm_cutecute.compile.

Empirical evidence

Tested all 4 combinations on an H200 devbox by manually cp-ing wheel contents into site-packages:

.py from .so from Smoke test (gpu.GPUModuleOp(StringAttr, loc=loc))
-libs-base -libs-base ✅ PASS
-libs-cu13 -libs-cu13 ✅ PASS
-libs-cu13 -libs-base FAIL — exact CI TypeError, byte-for-byte
-libs-base -libs-cu13 ✅ PASS

Three of four states work. Only the mismatched .py=cu13 + .so=base breaks.

Fix

After install_sglang completes (with possibly mismatched state), force-reinstall -libs-cu13 last to guarantee both .py and .so come from the same wheel (BOTH-cu13 state):

$PIP_CMD install --force-reinstall --no-deps \
  "nvidia-cutlass-dsl-libs-cu13==${CUTLASS_DSL_VERSION}" \
  $PIP_INSTALL_SUFFIX

Version parsed from pyproject.toml to stay in sync. Skips for non-CU13 runners (only -libs-base installed there, no conflict possible).

Validation on devbox

  1. TypeError fix: forced BAD state on H200 devbox with UV_LINK_MODE=copy (matches CI), ran force_reinstall_cutlass_dsl_libs_cu13 — smoke test went FAIL → PASS, .so md5 changed from base's to cu13's.
  2. LoRA regression check: ran test/registered/lora/test_lora_qwen3_8b_logprob_diff.py against the fix on the same devbox — both subtests passed, KL divergence 2.8e-4 (threshold 5e-3). The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS regression from Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743.

Related PRs / supersedes

🤖 Generated with Claude Code


CI States

Latest PR Test (Base): ❌ Run #26216901406
Latest PR Test (Extra): ❌ Run #26216901321

@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label May 21, 2026
…-mix TypeError

nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and
-libs-cu13 are installed and they ship intentionally-different content
for the same site-packages paths:

  cutlass/_mlir/dialects/_gpu_ops_gen.py
  cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so

Each wrapper .py is paired with a matching pybind11 .so. The two pairs
use different MLIR Op constructor styles:

  -libs-base: super().__init__(self.build_generic(...))  (new-style)
  -libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style)

If install order leaves the .py from one wheel and the .so from the
other (reproducible by mixing the wheel contents), the wrapper's
super().__init__ call signature does not match what the loaded .so
accepts and the runtime raises:

  TypeError: __init__(): incompatible function arguments.
    1. __init__(self, operation: object) -> None

surfacing at kernel-compile time on H100 CU13 CI runners during eagle /
lora tests that go through flashinfer.rmsnorm_cute -> cute.compile.

Tested all 4 (.py, .so) combinations on an H200 devbox: only the
mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError
byte-for-byte. Three combinations pass.

Fix: after install_sglang completes (with possibly mismatched state),
force-reinstall -libs-cu13 last so both .py and .so come from the same
wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so
this stays in sync with whatever nvidia-cutlass-dsl version the project
pins. Skips for non-CU13 runners (no [cu13] extra, no conflict).

Verified on an H200 devbox:
  1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13
     -> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's.
  2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py
     -> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3).
     The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS
     regression from sgl-project#25743.
@Kangyan-Zhou Kangyan-Zhou force-pushed the fix_cutlass_libs_install_order branch from de055b3 to 1a0dbf2 Compare May 21, 2026 07:14
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the nvidia-cutlass-dsl[cu13] dependency to version 4.5.1 and adds a force_reinstall_cutlass_dsl_libs_cu13 function to the CI installation script to prevent library mismatches. Feedback was provided to use the ${REPO_ROOT} variable for the pyproject.toml file path in the script to ensure it is correctly located regardless of the current working directory.

return
fi

CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a relative path for python/pyproject.toml makes the script's behavior dependent on the current working directory. Since REPO_ROOT is already defined and used elsewhere in this script for robustness, it should be used here as well. Additionally, quoting the path is a good practice to handle potential spaces in the directory name.

Suggested change
CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "")
CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\\[[^]]+\\])?==\\K[0-9A-Za-z\\.\\-]+' "${REPO_ROOT}/python/pyproject.toml" || echo "")

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

@mmangkad I think your suggestion is correct, thanks for sharing it!

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

- Use "${REPO_ROOT}/python/pyproject.toml" instead of relative path so the
  version probe doesn't depend on the working directory the script is
  launched from (per gemini-code-assist review).
- Bump nvidia-cutlass-dsl[cu13] 4.5.0 -> 4.5.1 now that the wheel-mix
  TypeError is mitigated by force_reinstall_cutlass_dsl_libs_cu13. This
  re-applies sgl-project#25576 which was rolled back in sgl-project#25938 only because of the
  install-order bug.
@mmangkad
Copy link
Copy Markdown
Contributor

mmangkad commented May 21, 2026

@mmangkad I think your suggestion is correct, thanks for sharing it!

Yeah that was the issue because the order of install matters, not the version. Could we include the upgrade back to 4.5.1 here? I just saw it

@Kangyan-Zhou Kangyan-Zhou merged commit caa9f08 into sgl-project:main May 21, 2026
253 of 332 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bypass-fastfail dependencies Pull requests that update a dependency file run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants