Skip to content

Add manual GPU allocator reset workflow [skip ci]#478

Merged
mawad-amd merged 1 commit intomainfrom
infra/gpu-reset-tool
Mar 24, 2026
Merged

Add manual GPU allocator reset workflow [skip ci]#478
mawad-amd merged 1 commit intomainfrom
infra/gpu-reset-tool

Conversation

@mawad-amd
Copy link
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

workflow_dispatch action to reset the shared GPU bitmap on self-hosted
runners when allocations leak (e.g., runner crash without release).
Intended to live on this branch permanently — trigger from Actions UI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mawad-amd mawad-amd requested review from BKP and neoblizz as code owners March 24, 2026 03:30
Copilot AI review requested due to automatic review settings March 24, 2026 03:30
@mawad-amd mawad-amd merged commit 628c0b7 into main Mar 24, 2026
8 of 35 checks passed
@mawad-amd mawad-amd deleted the infra/gpu-reset-tool branch March 24, 2026 03:30
@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Mar 24, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a manually-triggered GitHub Actions workflow intended to reset the GPU allocator state on self-hosted AMDGPU runners to recover from leaked allocations.

Changes:

  • Introduces a workflow_dispatch workflow to reset the GPU allocator bitmap state file.
  • Prints allocator state before/after reset and forces the bitmap to “free”.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants