[ROCm] Hotpatch aiter gluon pa_mqa_logits 3D instr_shape for GLM-5 (Triton 3.5+)#26572
Conversation
Add idempotent patch script (ROCm/aiter#2575) for older vendored aiter in ROCm images: base _gluon_deepgemm_fp8_paged_mqa_logits used 2D MFMA instr_shape; GLM-5 needs 3D when _Use_2d_instr_shape_mfma_layout is false. Apply at docker build and AMD CI. Document GLM-5 ROCm env vars (SGLANG_ROCM_FUSED_DECODE_MLA=0, quick reduce, safetensors fast GPU). Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a hotpatch script (patch_aiter_gluon_pa_mqa_logits.py) to update aiter's gluon/pa_mqa_logits.py for Triton 3.5+ compatibility on ROCm, applying it in both the Dockerfile and CI dependency installation scripts. It also updates the GLM-5 deployment documentation and command generator to include AMD-specific environment variables. A review comment suggests using a with statement when reading the target file in the hotpatch script to ensure proper resource management.
| print(f"[aiter-hotpatch] {target} not found, skipping") | ||
| return False | ||
|
|
||
| src = open(target, encoding="utf-8").read() |
There was a problem hiding this comment.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Hi @ChangLiu0709,
So the gluon path may only become reachable after the fix is in aiter. |
Background
While adding GLM-5 FP8 on MI355X with SGLang in InferenceX (#1572), serving fails due to a bug in the NSA attention gluon
pa_mqa_logitskernel (#26533). See Summary below for root cause and this PR's fix.Summary
GLM-5 FP8 on MI355X (DSA / NSA attention) can fail when the vendored aiter in ROCm images predates ROCm/aiter#2575: the base
_gluon_deepgemm_fp8_paged_mqa_logitskernel hardcoded 2Dinstr_shape=[16, 16]while Triton ≥ 3.5 requires 3D[16, 16, 32]when_Use_2d_instr_shape_mfma_layoutis false (same conditional already used in the preshuffle variants).This PR ports the idempotent hotpatch used in InferenceX disagg benchmarks into the SGLang repo:
scripts/ci/amd/patch_aiter_gluon_pa_mqa_logits.py— shared patch script (no-op when aiter already includes ROCm/aiter#2575)docker/rocm.Dockerfile— run patch after aiter checkout (beforesetup.pybuild)scripts/ci/amd/amd_ci_install_dependency.sh— run patch on/sgl-workspace/aiterin CI (covers pre-installed and rebuilt aiter)SGLANG_ROCM_FUSED_DECODE_MLA=0,ROCM_QUICK_REDUCE_QUANTIZATION=INT4,SAFETENSORS_FAST_GPU=1)Not in scope
46e6c92) already includes ROCm/aiter#2575. We only add an idempotent hotpatch for older vendored aiter.lmsysorg/sglang-rocmimages are rebuilt with aiter that includes ROCm/aiter#2575, the runtime/build-time patch is a no-op and InferenceX can drop the equivalentsetup_deps.shgluon patch (#1572).glm_moe_dsapip install: Handled in InferenceX for Mori images; SGLang uses in-treeGlmMoeDsaForCausalLMandtransformers==5.8.1.Co-authors
@ChangLiu0709
@chunfangamd
Test plan
python3 scripts/ci/amd/patch_aiter_gluon_pa_mqa_logits.pyon aiter checkout before Support w8a8 fp8 block-wise quantization #2575 → patch applies once; second run is no-oppython3 scripts/ci/amd/patch_aiter_gluon_pa_mqa_logits.pyon current aiter46e6c92→ warns or no-op (pattern absent)test_glm5_perf_mi35x.py) with rebuilt ROCm imagezai-org/GLM-5-FP8serve on MI355X with documented ROCm env varsRelated
setup_deps.shuntil this lands in images)a1bdcec— upstream fix for gluonpa_mqa_logitsMFMAinstr_shapeMade with Cursor
CI States
Latest PR Test (Base): ✅ Run #26572635192
Latest PR Test (Extra): ❌ Run #26572635058