Skip to content

Mi355#455

Merged
msaroufim merged 20 commits into
mainfrom
mi355
Mar 4, 2026
Merged

Mi355#455
msaroufim merged 20 commits into
mainfrom
mi355

Conversation

@msaroufim

Copy link
Copy Markdown
Member

No description provided.

Mark Saroufim added 4 commits March 3, 2026 18:32
Add MI355X to GitHubGPU enum, GPU_TO_SM mapping, and github launcher
runner routing with runner label mia1-p02-g29.
Add container image ghcr.io/gpu-mode/amd-runner:main with GPU device
passthrough to amd_workflow.yml. Add numpy to AMD_REQUIREMENTS.
- Upgrade ROCm from 6.3.1 to 7.2
- Upgrade PyTorch to nightly rocm7.2
- Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs
- Remove UCX, OpenMPI, and rocSHMEM builds (no longer needed)
@github-actions

github-actions Bot commented Mar 4, 2026

Copy link
Copy Markdown

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/libkernelbot
  consts.py
  utils.py
Project Total  

This report was generated by python-coverage-comment-action

Mark Saroufim added 16 commits March 3, 2026 19:04
…GPU deps

- Upgrade ROCm from 6.3.1 to 7.1 (stable, matches host ROCm 7.0.1)
- Use stable torch 2.10.0+rocm7.1 instead of nightly
- Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs
- Remove UCX, OpenMPI, and rocSHMEM builds
Fixes EACCES errors from root-owned files left by previous container runs.
- Replace python3.10 packages with python3 equivalents
- Use noble ROCm package instead of jammy
- Add --break-system-packages for pip on Noble
- Remove git-core PPA (not needed on Noble)
- Remove linux-headers install (not available during build)
Ensures the workflow timeout is at least 30 minutes to account for
Docker image pulls and container initialization on new runners.
@msaroufim msaroufim merged commit ee0919e into main Mar 4, 2026
3 of 5 checks passed
SinatrasC pushed a commit to SinatrasC/kernelbot that referenced this pull request Jun 17, 2026
* Add MI355X GPU support for AMD GitHub runner

Add MI355X to GitHubGPU enum, GPU_TO_SM mapping, and github launcher
runner routing with runner label mia1-p02-g29.

* Use amd-runner Docker container for MI355X workflow

Add container image ghcr.io/gpu-mode/amd-runner:main with GPU device
passthrough to amd_workflow.yml. Add numpy to AMD_REQUIREMENTS.

* Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps

- Upgrade ROCm from 6.3.1 to 7.2
- Upgrade PyTorch to nightly rocm7.2
- Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs
- Remove UCX, OpenMPI, and rocSHMEM builds (no longer needed)

* Update AMD_REQUIREMENTS to use ROCm 7.2 nightly index

* Fix container permissions: run as root for GitHub Actions compatibility

* Revert "Update AMD_REQUIREMENTS to use ROCm 7.2 nightly index"

This reverts commit 06c7ba3.

* Revert "Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps"

This reverts commit 7e5949c.

* Simplify AMD workflow for MI355X: use container deps, skip requirements install

* Reapply "Update AMD Dockerfile: ROCm 7.2, latest aiter, remove multi-GPU deps"

This reverts commit 6067a0c.

* Update AMD Dockerfile to ROCm 7.1 stable, latest aiter, remove multi-GPU deps

- Upgrade ROCm from 6.3.1 to 7.1 (stable, matches host ROCm 7.0.1)
- Use stable torch 2.10.0+rocm7.1 instead of nightly
- Update aiter to latest commit (f3be04a) for recent FP4 kernel APIs
- Remove UCX, OpenMPI, and rocSHMEM builds

* Use mia1-p02-g29 runner to build AMD Docker image

* Add workspace cleanup step before checkout in AMD Docker build

Fixes EACCES errors from root-owned files left by previous container runs.

* Remove workspace cleanup step from AMD Docker build

* Use GITHUB_TOKEN instead of PUBLISH_TOKEN for ghcr.io login

* Fix Dockerfile for Ubuntu 24.04 (Noble) base image

- Replace python3.10 packages with python3 equivalents
- Use noble ROCm package instead of jammy
- Add --break-system-packages for pip on Noble
- Remove git-core PPA (not needed on Noble)
- Remove linux-headers install (not available during build)

* Remove pip upgrade step (incompatible with Noble system pip)

* Use amd-runner:mi355 Docker image with working aiter + ROCm

* Fix pip install: add --break-system-packages for container environment

* Update amd-docker.Dockerfile

* Set minimum GitHub timeout to DEFAULT_GITHUB_TIMEOUT_MINUTES

Ensures the workflow timeout is at least 30 minutes to account for
Docker image pulls and container initialization on new runners.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant