Skip to content

[ROCm] Hotpatch aiter gluon pa_mqa_logits 3D instr_shape for GLM-5 (Triton 3.5+)#26572

Open
ChangLiu0709 wants to merge 1 commit into
sgl-project:mainfrom
ChangLiu0709:chang/glm5-rocm-gluon-pa-mqa-instr-shape
Open

[ROCm] Hotpatch aiter gluon pa_mqa_logits 3D instr_shape for GLM-5 (Triton 3.5+)#26572
ChangLiu0709 wants to merge 1 commit into
sgl-project:mainfrom
ChangLiu0709:chang/glm5-rocm-gluon-pa-mqa-instr-shape

Conversation

@ChangLiu0709
Copy link
Copy Markdown
Contributor

@ChangLiu0709 ChangLiu0709 commented May 28, 2026

Background

While adding GLM-5 FP8 on MI355X with SGLang in InferenceX (#1572), serving fails due to a bug in the NSA attention gluon pa_mqa_logits kernel (#26533). See Summary below for root cause and this PR's fix.

Summary

GLM-5 FP8 on MI355X (DSA / NSA attention) can fail when the vendored aiter in ROCm images predates ROCm/aiter#2575: the base _gluon_deepgemm_fp8_paged_mqa_logits kernel hardcoded 2D instr_shape=[16, 16] while Triton ≥ 3.5 requires 3D [16, 16, 32] when _Use_2d_instr_shape_mfma_layout is false (same conditional already used in the preshuffle variants).

This PR ports the idempotent hotpatch used in InferenceX disagg benchmarks into the SGLang repo:

  • scripts/ci/amd/patch_aiter_gluon_pa_mqa_logits.py — shared patch script (no-op when aiter already includes ROCm/aiter#2575)
  • docker/rocm.Dockerfile — run patch after aiter checkout (before setup.py build)
  • scripts/ci/amd/amd_ci_install_dependency.sh — run patch on /sgl-workspace/aiter in CI (covers pre-installed and rebuilt aiter)
  • GLM-5 docs — document ROCm env vars used in production (SGLANG_ROCM_FUSED_DECODE_MLA=0, ROCM_QUICK_REDUCE_QUANTIZATION=INT4, SAFETENSORS_FAST_GPU=1)

Not in scope

  • AITER_COMMIT bump in this PR: The Dockerfile default (46e6c92) already includes ROCm/aiter#2575. We only add an idempotent hotpatch for older vendored aiter.
  • When the hotpatch is unnecessary: Once lmsysorg/sglang-rocm images are rebuilt with aiter that includes ROCm/aiter#2575, the runtime/build-time patch is a no-op and InferenceX can drop the equivalent setup_deps.sh gluon patch (#1572).
  • Transformers glm_moe_dsa pip install: Handled in InferenceX for Mori images; SGLang uses in-tree GlmMoeDsaForCausalLM and transformers==5.8.1.

Co-authors

@ChangLiu0709
@chunfangamd

Test plan

  • python3 scripts/ci/amd/patch_aiter_gluon_pa_mqa_logits.py on aiter checkout before Support w8a8 fp8 block-wise quantization #2575 → patch applies once; second run is no-op
  • python3 scripts/ci/amd/patch_aiter_gluon_pa_mqa_logits.py on current aiter 46e6c92 → warns or no-op (pattern absent)
  • GLM-5 MI35x perf/accuracy CI (test_glm5_perf_mi35x.py) with rebuilt ROCm image
  • Manual: zai-org/GLM-5-FP8 serve on MI355X with documented ROCm env vars

Related

  • InferenceX PR #1572 — GLM-5 FP8 MI355X disagg CI (uses setup_deps.sh until this lands in images)
  • ROCm/aiter a1bdcec — upstream fix for gluon pa_mqa_logits MFMA instr_shape

Made with Cursor


CI States

Latest PR Test (Base): ✅ Run #26572635192
Latest PR Test (Extra): ❌ Run #26572635058

Add idempotent patch script (ROCm/aiter#2575) for older vendored aiter in
ROCm images: base _gluon_deepgemm_fp8_paged_mqa_logits used 2D MFMA
instr_shape; GLM-5 needs 3D when _Use_2d_instr_shape_mfma_layout is false.

Apply at docker build and AMD CI. Document GLM-5 ROCm env vars
(SGLANG_ROCM_FUSED_DECODE_MLA=0, quick reduce, safetensors fast GPU).

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a hotpatch script (patch_aiter_gluon_pa_mqa_logits.py) to update aiter's gluon/pa_mqa_logits.py for Triton 3.5+ compatibility on ROCm, applying it in both the Dockerfile and CI dependency installation scripts. It also updates the GLM-5 deployment documentation and command generator to include AMD-specific environment variables. A review comment suggests using a with statement when reading the target file in the hotpatch script to ensure proper resource management.

print(f"[aiter-hotpatch] {target} not found, skipping")
return False

src = open(target, encoding="utf-8").read()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is recommended to use a with statement when opening files to ensure that file descriptors are closed properly and promptly, rather than relying on garbage collection.

Suggested change
src = open(target, encoding="utf-8").read()
with open(target, encoding="utf-8") as f:
src = f.read()

@ChangLiu0709 ChangLiu0709 marked this pull request as draft May 28, 2026 11:46
@ChangLiu0709 ChangLiu0709 marked this pull request as ready for review May 28, 2026 15:12
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@1am9trash
Copy link
Copy Markdown
Collaborator

Hi @ChangLiu0709,
Thanks for the patch. After tracing through the timeline, the dependency order on main looks like it already guarantees the bug can't happen:

  • aiter #2575 (2026-04-03) — fixes the instr_shape
  • sglang #22264 (2026-04-11) — bumps aiter to v0.1.12.post1 (includes PR#2575)
  • sglang #22657 (2026-04-13) — only then removes the if False: guard and enables the gluon kernel, explicitly citing PR#2575

So the gluon path may only become reachable after the fix is in aiter.
Could you share under what conditions this bug actually triggers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants