Skip to content

[controllers] feat: add Apple Silicon MPS support#70

Merged
Wangmerlyn merged 3 commits intomainfrom
feature/macm-support
Mar 9, 2026
Merged

[controllers] feat: add Apple Silicon MPS support#70
Wangmerlyn merged 3 commits intomainfrom
feature/macm-support

Conversation

@Wangmerlyn
Copy link
Owner

@Wangmerlyn Wangmerlyn commented Mar 9, 2026

Summary

Add full support for Mac M series chips (M1/M2/M3/M4) using Metal Performance Shaders (MPS).

Changes

Core Implementation

  • MacMGPUController: New controller using torch.mps backend
  • Supports MPS device 0 only (Apple Silicon limitation)
  • Uses torch.mps.synchronize() and torch.mps.empty_cache() for memory management
  • busy_threshold parameter accepted for API compatibility but has no effect

Platform Support

  • Platform Detection: Add MACM platform detection
  • Detects macOS + arm64 + MPS availability
  • Detection order: CUDA → ROCm → Mac M → CPU

Integration

  • GlobalGPUController: Integrate MacMGPUController for MACM platform
  • GPU Info: Add _query_macm() for Apple Silicon GPU info
  • Tests: Add tests/macm_controller/ with basic tests

Configuration

  • pyproject.toml: Add macm extras with psutil dependency
  • conftest.py: Add --run-macm option and macm_available fixture

Documentation

  • README.md: Add Mac M series installation instructions
  • docs/getting-started.md: Add Mac platform guide and MPS check
  • docs/concepts/architecture.md: Update with Mac M support
  • skills/SKILL.md: Add Mac M installation option and troubleshooting

Known Limitations

  • GPU Utilization Monitoring: Not available on macOS due to system API limitations. The busy_threshold parameter is accepted for API compatibility but has no effect.
  • Single Device: MPS only supports device 0.
  • Unified Memory: Apple Silicon uses unified memory architecture (shared with system RAM).

Testing

On Apple Silicon Mac with PyTorch MPS support:
pip install -e .[macm]
pytest tests/macm_controller/ --run-macm -v

Installation

pip install torch
pip install keep-gpu[macm]

Summary by CodeRabbit

  • New Features

    • Added support for Apple Silicon Macs (M1/M2/M3/M4) using Metal Performance Shaders (MPS).
    • GPU memory management and keepalive functionality now available for macOS systems.
    • Platform detection automatically selects MPS backend on compatible devices.
  • Documentation

    • Expanded installation guides with Mac M series-specific setup instructions.
    • Added platform-specific limitations and requirements for Apple Silicon devices.

Add full support for Mac M series chips (M1/M2/M3/M4) using Metal Performance Shaders (MPS):

- MacMGPUController: new controller using torch.mps backend
- Platform detection: add MACM platform to platform_manager
- GlobalGPUController: integrate MacMGPUController for MACM platform
- GPU info: add _query_macm() for Apple Silicon GPU info
- Tests: add tests/macm_controller/ with basic tests
- Config: add macm extras to pyproject.toml
- Docs: update README, getting-started, architecture, and SKILL.md

Note: GPU utilization monitoring is not available on macOS due to
system API limitations. The busy_threshold parameter is accepted for
API compatibility but has no effect.

[controllers] feat: add Mac M series support
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

Warning

Rate limit exceeded

@Wangmerlyn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 17 minutes and 52 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 348e624e-fb82-4b26-96ee-4c11b53d1642

📥 Commits

Reviewing files that changed from the base of the PR and between 8b2cf0a and 8a020ab.

📒 Files selected for processing (3)
  • docs/concepts/architecture.md
  • pyproject.toml
  • src/keep_gpu/global_gpu_controller/global_gpu_controller.py
📝 Walkthrough

Walkthrough

This PR adds macOS Apple Silicon (Mac M series) support as a new computing platform alongside CUDA and ROCm. It introduces MACM platform detection, a dedicated MacMGPUController for MPS-based memory management, integrates the new controller into platform selection logic, adds comprehensive documentation and guides, and establishes test infrastructure for Apple Silicon validation.

Changes

Cohort / File(s) Summary
Documentation
README.md, docs/getting-started.md, docs/concepts/architecture.md, skills/gpu-keepalive-with-keepgpu/SKILL.md
Added Mac M series (M1/M2/M3/M4) installation instructions, MPS backend usage notes, GPU utilization monitoring limitations, and troubleshooting guidance across multiple documentation sources. Updated platform architecture narrative to include ROCm and MACM alongside CUDA.
Platform Detection
src/keep_gpu/utilities/platform_manager.py
Introduced MACM computing platform enum value and _check_macm() detection function to identify Apple Silicon with MPS support. Extended platform detection order (CUDA → ROCm → MACM → CPU fallback).
GPU Controller Implementation
src/keep_gpu/single_gpu_controller/macm_gpu_controller.py
New MacMGPUController class extending BaseGPUController with background thread-based VRAM allocation for MPS. Includes device validation (rank=0 only), memory management via torch.mps.empty_cache(), and batch execution with retry logic.
Controller Integration
src/keep_gpu/global_gpu_controller/global_gpu_controller.py, src/keep_gpu/single_gpu_controller/__init__.py
Added MACM platform handling in GlobalGPUController.__init__ to select MacMGPUController and set gpu_ids to [0] for Mac M systems. Exported MacMGPUController from single_gpu_controller package.
Test Infrastructure
tests/conftest.py, tests/macm_controller/test_macm_basic.py
Added --run-macm pytest option and macm_available fixture for Apple Silicon detection. Implemented basic unit tests covering lifecycle, context manager usage, rank validation, and platform property verification. Updated test collection to conditionally skip macm tests.
Configuration
pyproject.toml
Added pytest marker macm: tests that require Apple Silicon with MPS to test configuration.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant GGC as GlobalGPUController
    participant PM as PlatformManager
    participant SGC as MacMGPUController
    
    App->>GGC: __init__(gpu_ids=None)
    GGC->>PM: detect_platform()
    PM->>PM: check_cuda()
    Note over PM: Not detected
    PM->>PM: check_rocm()
    Note over PM: Not detected
    PM->>PM: check_macm()
    Note over PM: MPS available on Apple Silicon
    PM-->>GGC: ComputingPlatform.MACM
    GGC->>GGC: controller_cls = MacMGPUController
    GGC->>GGC: gpu_ids = [0]
    GGC->>SGC: __init__(rank=0, vram_to_keep="1000 MB")
    SGC->>SGC: device="mps"<br/>platform=MACM
    SGC-->>GGC: controller initialized
    GGC-->>App: ready
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

  • PR #56 - Added ROCm GPU controller support with similar platform detection and controller selection patterns across GlobalGPUController and test gating infrastructure.
  • PR #63 - Modified GlobalGPUController.__init__ for per-GPU controller instantiation; this PR follows the same modification pattern for MACM integration.

Poem

🐰 A rabbit hops through Silicon valleys,
Where Metal shimmers bright on Apple's rallies,
VRAM keepalive dances—M-series so fleet,
From CUDA to ROCm, now MPS beats,
KeepGPU's threads leap, platforms compete! 🍎✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.53% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[controllers] feat: add Apple Silicon MPS support' accurately summarizes the main change—adding support for Apple Silicon Macs with MPS backend.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/macm-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the application's hardware compatibility by introducing comprehensive support for Apple Silicon Macs. It integrates the Metal Performance Shaders (MPS) backend, allowing users on M-series chips to leverage their GPU resources effectively. The changes encompass core controller implementation, platform detection, and extensive documentation updates, making the application accessible to a broader user base on macOS.

Highlights

  • Apple Silicon (Mac M series) Support: Introduced full support for Mac M series chips (M1/M2/M3/M4) using the Metal Performance Shaders (MPS) backend via a new MacMGPUController.
  • Platform Detection and Integration: Added MACM platform detection for macOS + arm64 + MPS availability, and integrated the MacMGPUController into the GlobalGPUController.
  • Documentation and Installation: Updated README.md, docs/getting-started.md, docs/concepts/architecture.md, and skills/SKILL.md with Mac M series installation instructions, limitations, and architectural details.
  • Testing Infrastructure: Added a macm extra to pyproject.toml, a --run-macm option to conftest.py, and new basic tests for the MacMGPUController.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • README.md
    • Added installation instructions for Mac M series.
    • Added a section detailing Mac M series limitations.
  • docs/concepts/architecture.md
    • Updated GlobalGPUController description to include Mac M series.
    • Expanded controller list to include MacMGPUController.
    • Updated GPU monitor section to note Mac M series lack of utilization monitoring.
    • Revised platform detection description to include Mac M series (MPS) path.
  • docs/getting-started.md
    • Updated platform information to include Mac M series support.
    • Added installation instructions for Mac M series.
    • Added instructions to verify MPS availability on Mac M series.
  • pyproject.toml
    • Added a macm marker for pytest.
  • skills/gpu-keepalive-with-keepgpu/SKILL.md
    • Added a new installation option for Mac M series.
    • Added notes and troubleshooting tips specific to Mac M users.
  • src/keep_gpu/global_gpu_controller/global_gpu_controller.py
    • Modified the __init__ method to select MacMGPUController for the MACM platform.
    • Adjusted GPU ID assignment to default to [0] for MACM platform.
  • src/keep_gpu/single_gpu_controller/init.py
    • Imported MacMGPUController and added it to __all__.
  • src/keep_gpu/single_gpu_controller/macm_gpu_controller.py
    • Added a new MacMGPUController class to manage GPU resources on Apple Silicon using torch.mps.
  • src/keep_gpu/utilities/platform_manager.py
    • Imported sys and platform modules.
    • Added MACM to the ComputingPlatform enum.
    • Implemented _check_macm function for detecting Apple Silicon with MPS support.
    • Included MACM in the _PLATFORM_CHECKS list.
    • Removed an empty line in the if __name__ == "__main__": block.
  • tests/conftest.py
    • Added --run-macm command-line option for pytest.
    • Modified pytest_collection_modifyitems to skip macm tests if --run-macm is not provided.
    • Added macm_available fixture to check for MPS availability.
  • tests/macm_controller/test_macm_basic.py
    • Added new basic tests for the MacMGPUController.
Activity
  • This is a new feature pull request, introducing Apple Silicon MPS support. No prior activity has been recorded.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Apple Silicon (Mac M-series) chips using the MPS backend. A security review found no vulnerabilities. However, the review identified a few issues that need to be addressed before merging: the macm optional dependency is missing from pyproject.toml, a bug in MacMGPUController causes it to allocate four times the requested amount of VRAM, and there's an opportunity to refactor duplicated code in the test configuration. Once these are resolved, this great addition should be ready to merge.

[tool.pytest.ini_options]
markers = [
"rocm: tests that require ROCm stack",
"macm: tests that require Apple Silicon with MPS",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The macm optional dependency is missing. The installation instructions in the documentation use pip install keep-gpu[macm], which will fail without this definition. According to the PR description, this should also include the psutil dependency.

Please add the macm extra to [project.optional-dependencies]. For example:

[project.optional-dependencies]
dev = [...]
rocm = [...]
macm = [
  "psutil",
]

logger.warning("rank %s: keep thread already running", self.rank)
return

self._num_elements = int(self.vram_to_keep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There appears to be a miscalculation in the number of tensor elements. self.vram_to_keep holds the requested memory in bytes, but it's being used directly as the number of elements for a torch.float32 tensor. Since each float32 element occupies 4 bytes, this results in allocating 4 times the requested VRAM.

To allocate the correct amount of memory, you should divide the number of bytes by the size of the element type (4 for float32).

Suggested change
self._num_elements = int(self.vram_to_keep)
self._num_elements = int(self.vram_to_keep) // 4

Comment on lines 19 to +30
def pytest_collection_modifyitems(config, items):
if config.getoption("--run-rocm"):
return
if not config.getoption("--run-rocm"):
skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run")
for item in items:
if "rocm" in item.keywords:
item.add_marker(skip_rocm)

skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run")
for item in items:
if "rocm" in item.keywords:
item.add_marker(skip_rocm)
if not config.getoption("--run-macm"):
skip_macm = pytest.mark.skip(reason="need --run-macm option to run")
for item in items:
if "macm" in item.keywords:
item.add_marker(skip_macm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for skipping tests based on markers is duplicated for rocm and macm. You could refactor this into a loop to make it more maintainable and easier to add new markers in the future.

Suggested change
def pytest_collection_modifyitems(config, items):
if config.getoption("--run-rocm"):
return
if not config.getoption("--run-rocm"):
skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run")
for item in items:
if "rocm" in item.keywords:
item.add_marker(skip_rocm)
skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run")
for item in items:
if "rocm" in item.keywords:
item.add_marker(skip_rocm)
if not config.getoption("--run-macm"):
skip_macm = pytest.mark.skip(reason="need --run-macm option to run")
for item in items:
if "macm" in item.keywords:
item.add_marker(skip_macm)
def pytest_collection_modifyitems(config, items):
markers_to_skip = {
"rocm": "--run-rocm",
"macm": "--run-macm",
}
for marker_name, option_name in markers_to_skip.items():
if not config.getoption(option_name):
skip_marker = pytest.mark.skip(reason=f"need {option_name} option to run")
for item in items:
if marker_name in item.keywords:
item.add_marker(skip_marker)

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b2cf0acf9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +51 to +52
if self.computing_platform == ComputingPlatform.MACM:
self.gpu_ids = [0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor caller-supplied GPU IDs on MACM

This branch unconditionally rewrites gpu_ids to [0] on MACM, so explicit inputs like gpu_ids=[] or gpu_ids=[3] are silently ignored and still start work on device 0. That makes API/CLI behavior inconsistent with CUDA/ROCm and can unexpectedly occupy the GPU when callers expected validation or no-op behavior; preserve None -> [0] but validate/reject incompatible explicit IDs instead of overriding them.

Useful? React with 👍 / 👎.

- **Mac M series (M1/M2/M3/M4)**
```bash
pip install torch
pip install keep-gpu[macm]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove unsupported macm extra from install command

The new installation snippet tells users to run pip install keep-gpu[macm], but this repository’s pyproject.toml only defines dev and rocm extras, so macm is not a real extra. Users following this path get an unknown-extra install warning and no guaranteed Mac-specific dependency set, which makes onboarding and support guidance unreliable.

Useful? React with 👍 / 👎.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)

51-56: Consider warning when user-provided gpu_ids are overridden.

When computing_platform == MACM, user-provided gpu_ids are silently replaced with [0]. This could surprise users who explicitly pass gpu_ids=[1] or expect multi-GPU behavior. The MCP server also passes through gpu_ids without platform awareness (see src/keep_gpu/mcp/server.py lines 89-117).

Consider logging a warning when non-default gpu_ids are overridden:

💡 Proposed enhancement
         if self.computing_platform == ComputingPlatform.MACM:
+            if gpu_ids is not None and gpu_ids != [0]:
+                import logging
+                logging.getLogger(__name__).warning(
+                    "MPS only supports device 0; ignoring gpu_ids=%s", gpu_ids
+                )
             self.gpu_ids = [0]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/keep_gpu/global_gpu_controller/global_gpu_controller.py` around lines 51
- 56, In GlobalGPUController (the block that checks computing_platform ==
ComputingPlatform.MACM and assigns self.gpu_ids = [0]), detect when an explicit
gpu_ids was provided (gpu_ids is not None) and log a warning that the
user-provided gpu_ids are being overridden due to MACM platform constraints;
include both the original gpu_ids value and the enforced [0] in the warning
message and use the existing logger (or Python logging) so callers (and the MCP
server path) can see the override.
docs/getting-started.md (1)

86-87: Consider adding Mac-specific troubleshooting guidance.

The troubleshooting section only references CUDA errors and the benchmark tool. Mac M series users encountering MPS issues may not find this helpful. Consider adding a brief note for MPS troubleshooting or clarifying that the benchmark is CUDA-specific.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/getting-started.md` around lines 86 - 87, Add a short Mac-specific
troubleshooting note to the existing CUDA benchmark guidance: clarify that
`python -m keep_gpu.benchmark` is CUDA-specific, and add a line addressing Apple
Silicon/MPS users (mention MPS-related failures, suggest toggling MPS or
checking PyTorch MPS backend, macOS and Xcode/toolchain updates, and refer users
to MPS docs or a troubleshooting link). Include this alongside the existing CUDA
sentence so MPS users know the benchmark won't apply and what to try instead.
src/keep_gpu/single_gpu_controller/macm_gpu_controller.py (2)

122-127: Use logger.exception to capture full traceback on allocation failure.

Per static analysis hint TRY400, logger.exception should be used instead of logger.error when logging within an exception handler to include the traceback.

♻️ Proposed fix
             except RuntimeError as exc:
-                logger.error("rank %s: failed to allocate tensor: %s", self.rank, exc)
+                logger.exception("rank %s: failed to allocate tensor", self.rank)
                 torch.mps.empty_cache()
                 gc.collect()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/keep_gpu/single_gpu_controller/macm_gpu_controller.py` around lines 122 -
127, Replace the logger.error call in the except RuntimeError handler with
logger.exception so the full traceback is captured; in the exception block
inside the method in macm_gpu_controller.py where RuntimeError is caught (the
block that currently calls logger.error("rank %s: failed to allocate tensor:
%s", self.rank, exc)), change to logger.exception with the same contextual
message (including self.rank) and then keep the existing cleanup calls
torch.mps.empty_cache() and gc.collect() and the stop_evt.wait(self.interval)
return logic unchanged.

138-143: Log OOM errors for observability.

When an out-of-memory condition is caught, the error is silently handled without logging. This makes it harder to diagnose memory pressure issues. Consider logging a warning when OOM occurs.

♻️ Proposed fix
             except RuntimeError as exc:
                 if "out of memory" in str(exc).lower():
+                    logger.warning("rank %s: MPS OOM, clearing cache", self.rank)
                     torch.mps.empty_cache()
                     gc.collect()
+                else:
+                    logger.exception("rank %s: runtime error in keep loop", self.rank)
                 if stop_evt.wait(self.interval):
                     break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/keep_gpu/single_gpu_controller/macm_gpu_controller.py` around lines 138 -
143, The except block catching RuntimeError in the polling loop (the clause
"except RuntimeError as exc") swallows OOMs without logging; update it to log a
warning when "out of memory" is detected by calling the project's logger (or
processLogger) with the exception details before running torch.mps.empty_cache()
and gc.collect(), e.g., emit a message that includes exc and context (e.g.,
function/class macm_gpu_controller polling loop), and keep the existing
stop_evt.wait(self.interval) behavior unchanged so observability is improved
without changing control flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/getting-started.md`:
- Around line 86-87: Add a short Mac-specific troubleshooting note to the
existing CUDA benchmark guidance: clarify that `python -m keep_gpu.benchmark` is
CUDA-specific, and add a line addressing Apple Silicon/MPS users (mention
MPS-related failures, suggest toggling MPS or checking PyTorch MPS backend,
macOS and Xcode/toolchain updates, and refer users to MPS docs or a
troubleshooting link). Include this alongside the existing CUDA sentence so MPS
users know the benchmark won't apply and what to try instead.

In `@src/keep_gpu/global_gpu_controller/global_gpu_controller.py`:
- Around line 51-56: In GlobalGPUController (the block that checks
computing_platform == ComputingPlatform.MACM and assigns self.gpu_ids = [0]),
detect when an explicit gpu_ids was provided (gpu_ids is not None) and log a
warning that the user-provided gpu_ids are being overridden due to MACM platform
constraints; include both the original gpu_ids value and the enforced [0] in the
warning message and use the existing logger (or Python logging) so callers (and
the MCP server path) can see the override.

In `@src/keep_gpu/single_gpu_controller/macm_gpu_controller.py`:
- Around line 122-127: Replace the logger.error call in the except RuntimeError
handler with logger.exception so the full traceback is captured; in the
exception block inside the method in macm_gpu_controller.py where RuntimeError
is caught (the block that currently calls logger.error("rank %s: failed to
allocate tensor: %s", self.rank, exc)), change to logger.exception with the same
contextual message (including self.rank) and then keep the existing cleanup
calls torch.mps.empty_cache() and gc.collect() and the
stop_evt.wait(self.interval) return logic unchanged.
- Around line 138-143: The except block catching RuntimeError in the polling
loop (the clause "except RuntimeError as exc") swallows OOMs without logging;
update it to log a warning when "out of memory" is detected by calling the
project's logger (or processLogger) with the exception details before running
torch.mps.empty_cache() and gc.collect(), e.g., emit a message that includes exc
and context (e.g., function/class macm_gpu_controller polling loop), and keep
the existing stop_evt.wait(self.interval) behavior unchanged so observability is
improved without changing control flow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 757016a6-c92d-4fdb-8aec-2851953721e0

📥 Commits

Reviewing files that changed from the base of the PR and between 135bc7c and 8b2cf0a.

📒 Files selected for processing (12)
  • README.md
  • docs/concepts/architecture.md
  • docs/getting-started.md
  • pyproject.toml
  • skills/gpu-keepalive-with-keepgpu/SKILL.md
  • src/keep_gpu/global_gpu_controller/global_gpu_controller.py
  • src/keep_gpu/single_gpu_controller/__init__.py
  • src/keep_gpu/single_gpu_controller/macm_gpu_controller.py
  • src/keep_gpu/utilities/platform_manager.py
  • tests/conftest.py
  • tests/macm_controller/__init__.py
  • tests/macm_controller/test_macm_basic.py

- Add macm extras to pyproject.toml (was missing)
- Validate gpu_ids in GlobalGPUController for MACM platform
  - Accept None -> [0] or [0]
  - Raise ValueError for other gpu_ids values
@Wangmerlyn Wangmerlyn merged commit 093e946 into main Mar 9, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant