gpt-oss-fp4-mi355x: pin to v0.19 + switch to AITER-env-based recipe#1531
gpt-oss-fp4-mi355x: pin to v0.19 + switch to AITER-env-based recipe#1531xiaohuguo2023 wants to merge 6 commits into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
@claude /sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm |
|
Claude finished @chunfangamd's task in 0s —— View job I'll analyze this and get back to you. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26173755971 |
There was a problem hiding this comment.
can u instead fix it in vllm upstream & switch to using nightly and then switch to that? what are the in progress PRs to fix it in vllm & rocm?
regression from ROCm 7.0201 → 7.0202
regression from v0.19 -> v0.21.0
fuse_allreduce_rms pass (recoverable via --compilation-config)
add it to recipes? https://recipes.vllm.ai/openai/gpt-oss-120b?hardware=mi355x
|
@functionstackx There seems to be a misunderstanding. If we check the script changes, we can find that this PR is not intended to address the regression issue; that fix is separate and currently ongoing. @xiaohuguo2023 based his work on version 0.19.0 and has also applied it to the latest image. However, the performance gain in the latest version is lower than that of version 0.19.0, which was anticipated. Absolutely, we should update the recipe page |
|
@xiaohuguo2023, could you please double-check the performance gain? It seems not to align with the local test. |
what is the vLLM PRs that fixes this in upstream HEAD |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26186157722 |
88e13ec to
1b98226
Compare
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm |
|
@chunfangamd Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26227187638 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26227015172 |
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm |
|
@xiaohuguo2023 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26358364428 |
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm |
|
Claude finished @chunfangamd's task in 1m 33s —— View job PR Review: gpt-oss-fp4-mi355x pin to v0.19 + AITER recipe
SummaryOne blocking issue found: the new Everything else looks good:
|
|
@chunfangamd Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26387067776 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6b98c26. Configure here.
There was a problem hiding this comment.
LGTM — benchmark recipe update (image pin + AITER env vars + sweep-space expansion); only finding is a nit on the warmup-comment wording.
Extended reasoning...
Overview
This PR is scoped to the gpt-oss-fp4-mi355x-vllm benchmark recipe: it pins the vLLM image back from v0.21.0 to v0.19.0 (regression hedge), switches benchmarks/single_node/gptoss_fp4_mi355x.sh to an AITER-env-based recipe (MOE/RMSNORM/UnifiedAttn/A16W4 + HSA_NO_SCRATCH_RECLAIM=1), drops legacy launcher flags, adds --max-num-seqs 256 and --async-scheduling, adds a pre-flight warmup pass, and expands the search-space in .github/configs/amd-master.yaml (adds tp=2 and widens tp=4). A matching perf-changelog.yaml entry is included, and pr-link was correctly substituted to #1531 after my prior comment.
Security risks
None. Changes are confined to benchmark configuration, a launcher shell script, and a human-readable changelog. No auth, crypto, permissions, network, or user-input handling code is touched.
Level of scrutiny
Low. This is benchmark/CI infrastructure (the launcher is invoked by the sweep system, not production serving), and the substantive choices (image pin rationale, AITER flag set, sweep grid) are domain-specific tuning decisions that the AMD recipe owners have already iterated on in-thread — chunfangamd and xiaohuguo2023 have run multiple /sweep test-config runs against this branch.
Other factors
The only finding from the bug-hunting pass is a [Nit] inline comment noting the warmup-pass rationale mis-describes the underlying tool — purely a maintainer-readability concern, no functional impact. My earlier review's blocker (literal XXX placeholder in the pr-link) has been resolved in the current diff. Sweep runs have been kicked off by the recipe owners, which is the appropriate validation for this kind of change.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26387052649 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26387052649 |
Pins the image back to vllm/vllm-openai-rocm:v0.19.0 (was bumped to v0.21.0 in #1406). v0.21 introduces a ROCm/AITER perf regression on MI355x for gpt-oss that we're still tracking down; staying on v0.19 in the meantime. Also rewrites the launcher to enable the AITER kernel paths via env vars (AITER MOE/RMSNorm/UnifiedAttn/A16W4 + HSA_NO_SCRATCH_RECLAIM=1) and drops the now-obsolete TRITON_ROPE/BUFFER_OPS/--attention-backend/ fuse_rope_kvcache/use_inductor_graph_partition bits. Also adds --max-num-seqs 256 and --async-scheduling.
6b98c26 to
e658541
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26396409260 |
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm |
|
can u instead fix it in vllm upstream & switch to using nightly and then switch to that? what are the in progress PRs to fix it in vllm & rocm? regression from ROCm 7.0201 → 7.0202
add it to recipes? https://recipes.vllm.ai/openai/gpt-oss-120b?hardware=mi355x |

Pinning back to vllm 0.19.0 (was bumped to 0.21.0 in Update gptoss-fp4-mi355x-vllm vLLM ROCm image to v0.21.0 #1406). vllm 0.21 hits
a ~10% regression on MI355x gpt-oss-fp4 — bisected to ~4% from the new ROCm
fuse_allreduce_rmspass (recoverable via--compilation-config) and ~6%from the ROCm 7.0201 → 7.0202 patch in the v0.21 image (not launcher-fixable).
Staying on 0.19 while we sort both. Heads-up: Klaud Cold will try to re-bump
on its next cron.
Also reworked the launcher: dropped
AMDGCN_USE_BUFFER_OPS,VLLM_ROCM_USE_AITER_TRITON_ROPE, the explicit--attention-backend,fuse_rope_kvcache, anduse_inductor_graph_partition. Added the AITER envvars (
VLLM_ROCM_USE_AITER_MOE/RMSNORM/UNIFIED_ATTENTION/FUSED_MOE_A16W4+HSA_NO_SCRATCH_RECLAIM=1), plus--max-num-seqs 256and--async-scheduling.Local numbers (single MI355x, v0.19.0 image, 3 runs each, median):
Avg +30.3% across all 20 combos vs the 2026-03-09 dashboard. TP=4 sees the
biggest wins (~100%+).
Note
Low Risk
Benchmark-only image, env, sweep, and launcher changes for one MI355X vLLM config; no production app or security surface.
Overview
Pins gptoss-fp4-mi355x-vllm back to
vllm/vllm-openai-rocm:v0.19.0(reverting the v0.21.0 bump) and widens the fixed-seq-len sweep with TP=2 and higher TP=4 concurrency ranges on 1k/1k and 8k/1k.The MI355X launcher (
gptoss_fp4_mi355x.sh) moves to an AITER env-var recipe (MOE, RMSNorm, unified attention, fused MoE A16W4,HSA_NO_SCRATCH_RECLAIM), drops the old MEC-firmware gate, buffer-ops, explicit attention backend, andfuse_rope_kvcachecompile flags, and adds--max-num-seqs 256and--async-scheduling.A discarded pre-flight warmup (
CONC×5prompts at full concurrency) runs before the measured benchmark so AITER Triton autotune does not under-report cold-cache CI throughput. perf-changelog.yaml documents the change.Reviewed by Cursor Bugbot for commit e658541. Bugbot is set up for automated code reviews on this repo. Configure here.