chore(ci): migrate ROCm matrix from 7.1 to 7.2#3267
Conversation
What changed: Updated ROCm matrix architecture version from 7.1 to 7.2 in torchtitan/.github/workflows/set-matrix.yaml. Updated ROCm nightly wheel index from rocm7.1 to rocm7.2 in torchtitan/.github/workflows/set-matrix.yaml. Scope: CI workflow matrix generation only. No runtime training logic changes. No CUDA matrix changes. Expected impact: ROCm jobs in GitHub Actions will install PyTorch nightly wheels from the ROCm 7.2 channel. ROCm-tagged matrix entries will report gpu-arch-version 7.2.
|
Unknown label
|
|
The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:
Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
@claude review this PR |
Export HIPBLASLT_LOG_LEVEL=5 and HIPBLASLT_LOG_MASK=0xFFFFFFFF for ROCm jobs Set TENSILE_LOG_LEVEL=1 to increase Tensile diagnostics Echo logging configuration for easier CI debugging
|
@claude review this PR |
|
Claude finished @tianyu-l's task in 58s —— View job PR Review: migrate ROCm matrix from 7.1 → 7.2
Verdict: LGTM. This is a clean, low-risk config-only change. The actual diff is exactly two lines (both in the ROCm matrix block), matching the PR description: - "gpu-arch-version": "7.1",
+ "gpu-arch-version": "7.2",
- "index-url": "https://download.pytorch.org/whl/nightly/rocm7.1",
+ "index-url": "https://download.pytorch.org/whl/nightly/rocm7.2",
Findings
Minor note (non-blocking)The PR body's rollback plan lists Process noteThis branch can't be diffed against Nice, tightly-scoped change. 👍 |
tianyu-l
left a comment
There was a problem hiding this comment.
let me know if test finishes and PR ready to merge
Feature test was failing due to loss baseline mismatch. I refreshed llama3_rocm_mi350x.txt accordingly 117df38. Could you please re-enable the CI? If it comes back green, this PR should be ready to merge. |
Summary
Migrate TorchTitan ROCm CI matrix defaults from 7.1 to 7.2 so ROCm jobs install PyTorch nightly wheels from the rocm7.2 channel.
Motivation
ROCm CI matrix generation is currently pinned to 7.1. Moving to 7.2 aligns CI with the intended ROCm wheel channel and avoids channel drift.
What changed
Updated gpu-arch-version in
torchtitan/.github/workflows/set-matrix.yaml:7.1 -> 7.2
Updated index-url in
torchtitan/.github/workflows/set-matrix.yaml:https://download.pytorch.org/whl/nightly/rocm7.1 -> https://download.pytorch.org/whl/nightly/rocm7.2
Scope
Only CI matrix generation in
torchtitan/.github/workflows/set-matrix.yaml.No model code, trainer code, distributed code, or test baseline changes.
No CUDA matrix changes.
Expected impact
ROCm matrix jobs install from the rocm7.2 nightly index.
Workflows that consume matrix.index-url inherit this update automatically.
Risk assessment
Low code risk: workflow config-only change.
Medium environment risk: ROCm 7.2 nightly package availability/compatibility could surface CI-only failures.
Validation plan
gpu-arch-version: 7.2
index-url: https://download.pytorch.org/whl/nightly/rocm7.2
Verify ROCm install steps complete successfully.
Verify CUDA jobs are unchanged.
Rollback plan
If regressions appear, revert the two lines in: