Skip to content

chore(ci): migrate ROCm matrix from 7.1 to 7.2#3267

Open
yuxinwan-amd wants to merge 12 commits into
pytorch:mainfrom
yuxinwan-amd:chore/rocm-7.2-ci
Open

chore(ci): migrate ROCm matrix from 7.1 to 7.2#3267
yuxinwan-amd wants to merge 12 commits into
pytorch:mainfrom
yuxinwan-amd:chore/rocm-7.2-ci

Conversation

@yuxinwan-amd

Copy link
Copy Markdown

Summary

Migrate TorchTitan ROCm CI matrix defaults from 7.1 to 7.2 so ROCm jobs install PyTorch nightly wheels from the rocm7.2 channel.

Motivation
ROCm CI matrix generation is currently pinned to 7.1. Moving to 7.2 aligns CI with the intended ROCm wheel channel and avoids channel drift.

What changed

  • Updated gpu-arch-version in torchtitan/.github/workflows/set-matrix.yaml:
    7.1 -> 7.2

  • Updated index-url in torchtitan/.github/workflows/set-matrix.yaml:

https://download.pytorch.org/whl/nightly/rocm7.1 -> https://download.pytorch.org/whl/nightly/rocm7.2

Scope

  • Only CI matrix generation in torchtitan/.github/workflows/set-matrix.yaml.

  • No model code, trainer code, distributed code, or test baseline changes.

  • No CUDA matrix changes.

Expected impact

  • ROCm matrix jobs install from the rocm7.2 nightly index.

  • Workflows that consume matrix.index-url inherit this update automatically.

Risk assessment

  • Low code risk: workflow config-only change.

  • Medium environment risk: ROCm 7.2 nightly package availability/compatibility could surface CI-only failures.

Validation plan

  • Confirm generated matrix in workflow logs includes:

gpu-arch-version: 7.2
index-url: https://download.pytorch.org/whl/nightly/rocm7.2

  • Verify ROCm install steps complete successfully.

  • Verify CUDA jobs are unchanged.

Rollback plan
If regressions appear, revert the two lines in:

torchtitan/.github/workflows/set-matrix.yaml
torchtitan/.github/workflows/set-matrix.yaml

What changed:
Updated ROCm matrix architecture version from 7.1 to 7.2 in torchtitan/.github/workflows/set-matrix.yaml.
Updated ROCm nightly wheel index from rocm7.1 to rocm7.2 in torchtitan/.github/workflows/set-matrix.yaml.

Scope:
CI workflow matrix generation only.
No runtime training logic changes.
No CUDA matrix changes.

Expected impact:
ROCm jobs in GitHub Actions will install PyTorch nightly wheels from the ROCm 7.2 channel.
ROCm-tagged matrix entries will report gpu-arch-version 7.2.
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 7, 2026
@pytorch-bot

pytorch-bot Bot commented May 7, 2026

Copy link
Copy Markdown

Unknown label ciflow/rocm.
Currently recognized labels are

  • ciflow/8gpu
  • ciflow/h100.8
  • ciflow/rl

@pytorch-bot

pytorch-bot Bot commented May 7, 2026

Copy link
Copy Markdown

The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:

  • ciflow/rocm
  • ciflow/8gpu

Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows.

@yuxinwan-amd

Copy link
Copy Markdown
Author

@claude review this PR

@yuxinwan-amd yuxinwan-amd marked this pull request as ready for review June 9, 2026 15:12
@tianyu-l

tianyu-l commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@claude review this PR

@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude finished @tianyu-l's task in 58s —— View job


PR Review: migrate ROCm matrix from 7.1 → 7.2

  • Read changed file (.github/workflows/set-matrix.yaml)
  • Verify diff scope matches PR description
  • Scan repo for other stale ROCm 7.1 references
  • Check docker base image / index-url consistency

Verdict: LGTM. This is a clean, low-risk config-only change. The actual diff is exactly two lines (both in the ROCm matrix block), matching the PR description:

-            "gpu-arch-version": "7.1",
+            "gpu-arch-version": "7.2",
-            "index-url": "https://download.pytorch.org/whl/nightly/rocm7.1",
+            "index-url": "https://download.pytorch.org/whl/nightly/rocm7.2",

gpu-arch-version and index-url are kept in sync (both now 7.2 / rocm7.2), which is the important correctness check here. The CUDA matrix is untouched, as intended.

Findings

  1. No stale 7.1 references remain. I grepped the whole repo — the only ROCm wheel-channel references are the two lines this PR updates. Nothing else needs to move in lockstep. ✅

  2. Docker base image is unpinned (pre-existing, not a blocker). .ci/docker/build.sh:32 builds the ROCm image from rocm/dev-ubuntu-22.04:latest. The base image ROCm version is therefore decoupled from the 7.2 wheel channel selected here. As long as latest provides a ROCm runtime compatible with the rocm7.2 wheels, this is fine — and it's the same situation that existed under 7.1. Just worth being aware that the wheel channel is pinned while the container runtime floats; if the rocm7.2 wheels need a runtime newer than whatever latest resolves to, that mismatch would surface as a CI-only failure. No change requested.

  3. Validation is environment-gated. As the PR notes, the real risk is ROCm 7.2 nightly wheel availability/compatibility, not the YAML. The matrix only includes ROCm on the ciflow/8gpu tag/label path or on push-to-main / schedule (set-matrix.yaml:107-125), so a normal PR run won't exercise it. Please confirm a labeled run actually installs from rocm7.2 and the ROCm install + integration steps pass before merge.

Minor note (non-blocking)

The PR body's rollback plan lists set-matrix.yaml twice — it's a single file. Cosmetic only.

Process note

This branch can't be diffed against origin/main directly (no shared merge base in this shallow checkout); I reviewed via the migration commit f203e55 and the current file state, which agree.

Nice, tightly-scoped change. 👍
· branch chore/rocm-7.2-ci

@tianyu-l tianyu-l left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if test finishes and PR ready to merge

@yuxinwan-amd

Copy link
Copy Markdown
Author

let me know if test finishes and PR ready to merge

Feature test was failing due to loss baseline mismatch. I refreshed llama3_rocm_mi350x.txt accordingly 117df38. Could you please re-enable the CI? If it comes back green, this PR should be ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot. module: rocm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants