[CI] Force-reinstall nvidia-cutlass-dsl-libs-cu13 last to avoid wheel-mix TypeError#25958
Conversation
…-mix TypeError
nvidia-cutlass-dsl[cu13] has additive PyPI extras: both -libs-base and
-libs-cu13 are installed and they ship intentionally-different content
for the same site-packages paths:
cutlass/_mlir/dialects/_gpu_ops_gen.py
cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-*.so
Each wrapper .py is paired with a matching pybind11 .so. The two pairs
use different MLIR Op constructor styles:
-libs-base: super().__init__(self.build_generic(...)) (new-style)
-libs-cu13: super().__init__(OPERATION_NAME, REGIONS, ...) (old-style)
If install order leaves the .py from one wheel and the .so from the
other (reproducible by mixing the wheel contents), the wrapper's
super().__init__ call signature does not match what the loaded .so
accepts and the runtime raises:
TypeError: __init__(): incompatible function arguments.
1. __init__(self, operation: object) -> None
surfacing at kernel-compile time on H100 CU13 CI runners during eagle /
lora tests that go through flashinfer.rmsnorm_cute -> cute.compile.
Tested all 4 (.py, .so) combinations on an H200 devbox: only the
mismatched '.py=cu13 + .so=base' fails, producing the exact CI TypeError
byte-for-byte. Three combinations pass.
Fix: after install_sglang completes (with possibly mismatched state),
force-reinstall -libs-cu13 last so both .py and .so come from the same
wheel (BOTH-cu13 state). The version is parsed from pyproject.toml so
this stays in sync with whatever nvidia-cutlass-dsl version the project
pins. Skips for non-CU13 runners (no [cu13] extra, no conflict).
Verified on an H200 devbox:
1. TypeError fix: forced bad state, ran force_reinstall_cutlass_dsl_libs_cu13
-> smoke test went FAIL -> PASS, .so md5 changed from base's to cu13's.
2. LoRA regression check: ran test_lora_qwen3_8b_logprob_diff.py
-> both subtests passed, KL divergence 2.8e-4 (threshold 5e-3).
The fix does NOT re-trigger the CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS
regression from sgl-project#25743.
de055b3 to
1a0dbf2
Compare
There was a problem hiding this comment.
Code Review
This pull request updates the nvidia-cutlass-dsl[cu13] dependency to version 4.5.1 and adds a force_reinstall_cutlass_dsl_libs_cu13 function to the CI installation script to prevent library mismatches. Feedback was provided to use the ${REPO_ROOT} variable for the pyproject.toml file path in the script to ensure it is correctly located regardless of the current working directory.
| return | ||
| fi | ||
|
|
||
| CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "") |
There was a problem hiding this comment.
Using a relative path for python/pyproject.toml makes the script's behavior dependent on the current working directory. Since REPO_ROOT is already defined and used elsewhere in this script for robustness, it should be used here as well. Additionally, quoting the path is a good practice to handle potential spaces in the directory name.
| CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\[[^]]+\])?==\K[0-9A-Za-z\.\-]+' python/pyproject.toml || echo "") | |
| CUTLASS_DSL_VERSION=$(grep -Po -m1 'nvidia-cutlass-dsl(\\[[^]]+\\])?==\\K[0-9A-Za-z\\.\\-]+' "${REPO_ROOT}/python/pyproject.toml" || echo "") |
|
@mmangkad I think your suggestion is correct, thanks for sharing it! |
|
/tag-and-rerun-ci |
- Use "${REPO_ROOT}/python/pyproject.toml" instead of relative path so the
version probe doesn't depend on the working directory the script is
launched from (per gemini-code-assist review).
- Bump nvidia-cutlass-dsl[cu13] 4.5.0 -> 4.5.1 now that the wheel-mix
TypeError is mitigated by force_reinstall_cutlass_dsl_libs_cu13. This
re-applies sgl-project#25576 which was rolled back in sgl-project#25938 only because of the
install-order bug.
Yeah that was the issue because the order of install matters, not the version. |
Root cause
nvidia-cutlass-dsl[cu13]has additive PyPI extras — installing it pulls in bothnvidia-cutlass-dsl-libs-baseANDnvidia-cutlass-dsl-libs-cu13. The two wheels ship intentionally-different content for the same paths:-libs-base-libs-cu13cutlass/_mlir/dialects/_gpu_ops_gen.pysuper().__init__(self.build_generic(...))(new-style single object)super().__init__(OPERATION_NAME, REGIONS, ...)(old-style positional)cutlass/_mlir/_mlir_libs/_cutlass_ir.cpython-310-x86_64-linux-gnu.so(operation: object)Each wheel's
.pyis paired with a.sothat has the matching API. If install order leaves the.pyfrom one wheel and the.sofrom the other (which can happen viauv's install ordering), you get the hard TypeError seen in CI:This surfaces at kernel-compile time on CU13 CI runners during eagle / lora tests that go through
flashinfer.rmsnorm_cute→cute.compile.Empirical evidence
Tested all 4 combinations on an H200 devbox by manually
cp-ing wheel contents into site-packages:.pyfrom.sofromgpu.GPUModuleOp(StringAttr, loc=loc))-libs-base-libs-base-libs-cu13-libs-cu13-libs-cu13-libs-base-libs-base-libs-cu13Three of four states work. Only the mismatched
.py=cu13 + .so=basebreaks.Fix
After
install_sglangcompletes (with possibly mismatched state), force-reinstall-libs-cu13last to guarantee both.pyand.socome from the same wheel (BOTH-cu13 state):Version parsed from
pyproject.tomlto stay in sync. Skips for non-CU13 runners (only-libs-baseinstalled there, no conflict possible).Validation on devbox
UV_LINK_MODE=copy(matches CI), ranforce_reinstall_cutlass_dsl_libs_cu13— smoke test went FAIL → PASS,.somd5 changed from base's to cu13's.test/registered/lora/test_lora_qwen3_8b_logprob_diff.pyagainst the fix on the same devbox — both subtests passed, KL divergence2.8e-4(threshold5e-3). The fix does NOT re-trigger theCUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESSregression from Revert #25690 to unblock LoRA Qwen3-8B CUDA graph capture on main #25743.Related PRs / supersedes
🤖 Generated with Claude Code
CI States
Latest PR Test (Base): ❌ Run #26216901406
Latest PR Test (Extra): ❌ Run #26216901321