[controllers] feat: add Apple Silicon MPS support by Wangmerlyn · Pull Request #70 · Wangmerlyn/KeepGPU

Wangmerlyn · 2026-03-09T09:29:08Z

Summary

Add full support for Mac M series chips (M1/M2/M3/M4) using Metal Performance Shaders (MPS).

Changes

Core Implementation

MacMGPUController: New controller using torch.mps backend
Supports MPS device 0 only (Apple Silicon limitation)
Uses torch.mps.synchronize() and torch.mps.empty_cache() for memory management
busy_threshold parameter accepted for API compatibility but has no effect

Platform Support

Platform Detection: Add MACM platform detection
Detects macOS + arm64 + MPS availability
Detection order: CUDA → ROCm → Mac M → CPU

Integration

GlobalGPUController: Integrate MacMGPUController for MACM platform
GPU Info: Add _query_macm() for Apple Silicon GPU info
Tests: Add tests/macm_controller/ with basic tests

Configuration

pyproject.toml: Add macm extras with psutil dependency
conftest.py: Add --run-macm option and macm_available fixture

Documentation

README.md: Add Mac M series installation instructions
docs/getting-started.md: Add Mac platform guide and MPS check
docs/concepts/architecture.md: Update with Mac M support
skills/SKILL.md: Add Mac M installation option and troubleshooting

Known Limitations

GPU Utilization Monitoring: Not available on macOS due to system API limitations. The busy_threshold parameter is accepted for API compatibility but has no effect.
Single Device: MPS only supports device 0.
Unified Memory: Apple Silicon uses unified memory architecture (shared with system RAM).

Testing

On Apple Silicon Mac with PyTorch MPS support:
pip install -e .[macm]
pytest tests/macm_controller/ --run-macm -v

Installation

pip install torch
pip install keep-gpu[macm]

Summary by CodeRabbit

New Features
- Added support for Apple Silicon Macs (M1/M2/M3/M4) using Metal Performance Shaders (MPS).
- GPU memory management and keepalive functionality now available for macOS systems.
- Platform detection automatically selects MPS backend on compatible devices.
Documentation
- Expanded installation guides with Mac M series-specific setup instructions.
- Added platform-specific limitations and requirements for Apple Silicon devices.

Add full support for Mac M series chips (M1/M2/M3/M4) using Metal Performance Shaders (MPS): - MacMGPUController: new controller using torch.mps backend - Platform detection: add MACM platform to platform_manager - GlobalGPUController: integrate MacMGPUController for MACM platform - GPU info: add _query_macm() for Apple Silicon GPU info - Tests: add tests/macm_controller/ with basic tests - Config: add macm extras to pyproject.toml - Docs: update README, getting-started, architecture, and SKILL.md Note: GPU utilization monitoring is not available on macOS due to system API limitations. The busy_threshold parameter is accepted for API compatibility but has no effect. [controllers] feat: add Mac M series support

coderabbitai · 2026-03-09T09:29:30Z

Warning

Rate limit exceeded

@Wangmerlyn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 17 minutes and 52 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 348e624e-fb82-4b26-96ee-4c11b53d1642

📥 Commits

Reviewing files that changed from the base of the PR and between 8b2cf0a and 8a020ab.

📒 Files selected for processing (3)

docs/concepts/architecture.md
pyproject.toml
src/keep_gpu/global_gpu_controller/global_gpu_controller.py

📝 Walkthrough

Walkthrough

This PR adds macOS Apple Silicon (Mac M series) support as a new computing platform alongside CUDA and ROCm. It introduces MACM platform detection, a dedicated MacMGPUController for MPS-based memory management, integrates the new controller into platform selection logic, adds comprehensive documentation and guides, and establishes test infrastructure for Apple Silicon validation.

Changes

Cohort / File(s)	Summary
Documentation `README.md`, `docs/getting-started.md`, `docs/concepts/architecture.md`, `skills/gpu-keepalive-with-keepgpu/SKILL.md`	Added Mac M series (M1/M2/M3/M4) installation instructions, MPS backend usage notes, GPU utilization monitoring limitations, and troubleshooting guidance across multiple documentation sources. Updated platform architecture narrative to include ROCm and MACM alongside CUDA.
Platform Detection `src/keep_gpu/utilities/platform_manager.py`	Introduced MACM computing platform enum value and `_check_macm()` detection function to identify Apple Silicon with MPS support. Extended platform detection order (CUDA → ROCm → MACM → CPU fallback).
GPU Controller Implementation `src/keep_gpu/single_gpu_controller/macm_gpu_controller.py`	New `MacMGPUController` class extending `BaseGPUController` with background thread-based VRAM allocation for MPS. Includes device validation (rank=0 only), memory management via `torch.mps.empty_cache()`, and batch execution with retry logic.
Controller Integration `src/keep_gpu/global_gpu_controller/global_gpu_controller.py`, `src/keep_gpu/single_gpu_controller/__init__.py`	Added MACM platform handling in `GlobalGPUController.__init__` to select `MacMGPUController` and set gpu_ids to [0] for Mac M systems. Exported `MacMGPUController` from single_gpu_controller package.
Test Infrastructure `tests/conftest.py`, `tests/macm_controller/test_macm_basic.py`	Added `--run-macm` pytest option and `macm_available` fixture for Apple Silicon detection. Implemented basic unit tests covering lifecycle, context manager usage, rank validation, and platform property verification. Updated test collection to conditionally skip macm tests.
Configuration `pyproject.toml`	Added pytest marker `macm: tests that require Apple Silicon with MPS` to test configuration.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant GGC as GlobalGPUController
    participant PM as PlatformManager
    participant SGC as MacMGPUController
    
    App->>GGC: __init__(gpu_ids=None)
    GGC->>PM: detect_platform()
    PM->>PM: check_cuda()
    Note over PM: Not detected
    PM->>PM: check_rocm()
    Note over PM: Not detected
    PM->>PM: check_macm()
    Note over PM: MPS available on Apple Silicon
    PM-->>GGC: ComputingPlatform.MACM
    GGC->>GGC: controller_cls = MacMGPUController
    GGC->>GGC: gpu_ids = [0]
    GGC->>SGC: __init__(rank=0, vram_to_keep="1000 MB")
    SGC->>SGC: device="mps"<br/>platform=MACM
    SGC-->>GGC: controller initialized
    GGC-->>App: ready

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

PR #56 - Added ROCm GPU controller support with similar platform detection and controller selection patterns across GlobalGPUController and test gating infrastructure.
PR #63 - Modified GlobalGPUController.__init__ for per-GPU controller instantiation; this PR follows the same modification pattern for MACM integration.

Poem

🐰 A rabbit hops through Silicon valleys,
Where Metal shimmers bright on Apple's rallies,
VRAM keepalive dances—M-series so fleet,
From CUDA to ROCm, now MPS beats,
KeepGPU's threads leap, platforms compete! 🍎✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.53% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[controllers] feat: add Apple Silicon MPS support' accurately summarizes the main change—adding support for Apple Silicon Macs with MPS backend.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/macm-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-09T09:29:32Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the application's hardware compatibility by introducing comprehensive support for Apple Silicon Macs. It integrates the Metal Performance Shaders (MPS) backend, allowing users on M-series chips to leverage their GPU resources effectively. The changes encompass core controller implementation, platform detection, and extensive documentation updates, making the application accessible to a broader user base on macOS.

Highlights

Apple Silicon (Mac M series) Support: Introduced full support for Mac M series chips (M1/M2/M3/M4) using the Metal Performance Shaders (MPS) backend via a new MacMGPUController.
Platform Detection and Integration: Added MACM platform detection for macOS + arm64 + MPS availability, and integrated the MacMGPUController into the GlobalGPUController.
Documentation and Installation: Updated README.md, docs/getting-started.md, docs/concepts/architecture.md, and skills/SKILL.md with Mac M series installation instructions, limitations, and architectural details.
Testing Infrastructure: Added a macm extra to pyproject.toml, a --run-macm option to conftest.py, and new basic tests for the MacMGPUController.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

README.md
- Added installation instructions for Mac M series.
- Added a section detailing Mac M series limitations.
docs/concepts/architecture.md
- Updated GlobalGPUController description to include Mac M series.
- Expanded controller list to include MacMGPUController.
- Updated GPU monitor section to note Mac M series lack of utilization monitoring.
- Revised platform detection description to include Mac M series (MPS) path.
docs/getting-started.md
- Updated platform information to include Mac M series support.
- Added installation instructions for Mac M series.
- Added instructions to verify MPS availability on Mac M series.
pyproject.toml
- Added a macm marker for pytest.
skills/gpu-keepalive-with-keepgpu/SKILL.md
- Added a new installation option for Mac M series.
- Added notes and troubleshooting tips specific to Mac M users.
src/keep_gpu/global_gpu_controller/global_gpu_controller.py
- Modified the __init__ method to select MacMGPUController for the MACM platform.
- Adjusted GPU ID assignment to default to [0] for MACM platform.
src/keep_gpu/single_gpu_controller/init.py
- Imported MacMGPUController and added it to __all__.
src/keep_gpu/single_gpu_controller/macm_gpu_controller.py
- Added a new MacMGPUController class to manage GPU resources on Apple Silicon using torch.mps.
src/keep_gpu/utilities/platform_manager.py
- Imported sys and platform modules.
- Added MACM to the ComputingPlatform enum.
- Implemented _check_macm function for detecting Apple Silicon with MPS support.
- Included MACM in the _PLATFORM_CHECKS list.
- Removed an empty line in the if __name__ == "__main__": block.
tests/conftest.py
- Added --run-macm command-line option for pytest.
- Modified pytest_collection_modifyitems to skip macm tests if --run-macm is not provided.
- Added macm_available fixture to check for MPS availability.
tests/macm_controller/test_macm_basic.py
- Added new basic tests for the MacMGPUController.

Activity

This is a new feature pull request, introducing Apple Silicon MPS support. No prior activity has been recorded.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for Apple Silicon (Mac M-series) chips using the MPS backend. A security review found no vulnerabilities. However, the review identified a few issues that need to be addressed before merging: the macm optional dependency is missing from pyproject.toml, a bug in MacMGPUController causes it to allocate four times the requested amount of VRAM, and there's an opportunity to refactor duplicated code in the test configuration. Once these are resolved, this great addition should be ready to merge.

gemini-code-assist · 2026-03-09T09:30:54Z

pyproject.toml

 [tool.pytest.ini_options]
 markers = [
  "rocm: tests that require ROCm stack",
+  "macm: tests that require Apple Silicon with MPS",


The macm optional dependency is missing. The installation instructions in the documentation use pip install keep-gpu[macm], which will fail without this definition. According to the PR description, this should also include the psutil dependency.

Please add the macm extra to [project.optional-dependencies]. For example:

[project.optional-dependencies] dev = [...] rocm = [...] macm = [ "psutil", ]

gemini-code-assist · 2026-03-09T09:30:54Z

src/keep_gpu/single_gpu_controller/macm_gpu_controller.py

+            logger.warning("rank %s: keep thread already running", self.rank)
+            return
+
+        self._num_elements = int(self.vram_to_keep)


There appears to be a miscalculation in the number of tensor elements. self.vram_to_keep holds the requested memory in bytes, but it's being used directly as the number of elements for a torch.float32 tensor. Since each float32 element occupies 4 bytes, this results in allocating 4 times the requested VRAM.

To allocate the correct amount of memory, you should divide the number of bytes by the size of the element type (4 for float32).

Suggested change

self._num_elements = int(self.vram_to_keep)

self._num_elements = int(self.vram_to_keep) // 4

gemini-code-assist · 2026-03-09T09:30:54Z

tests/conftest.py

 def pytest_collection_modifyitems(config, items):
-    if config.getoption("--run-rocm"):
-        return
+    if not config.getoption("--run-rocm"):
+        skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run")
+        for item in items:
+            if "rocm" in item.keywords:
+                item.add_marker(skip_rocm)

-    skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run")
-    for item in items:
-        if "rocm" in item.keywords:
-            item.add_marker(skip_rocm)
+    if not config.getoption("--run-macm"):
+        skip_macm = pytest.mark.skip(reason="need --run-macm option to run")
+        for item in items:
+            if "macm" in item.keywords:
+                item.add_marker(skip_macm)


The logic for skipping tests based on markers is duplicated for rocm and macm. You could refactor this into a loop to make it more maintainable and easier to add new markers in the future.

Suggested change

def pytest_collection_modifyitems(config, items):

if config.getoption("--run-rocm"):

return

if not config.getoption("--run-rocm"):

skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run")

for item in items:

if "rocm" in item.keywords:

item.add_marker(skip_rocm)

skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run")

for item in items:

if "rocm" in item.keywords:

item.add_marker(skip_rocm)

if not config.getoption("--run-macm"):

skip_macm = pytest.mark.skip(reason="need --run-macm option to run")

for item in items:

if "macm" in item.keywords:

item.add_marker(skip_macm)

def pytest_collection_modifyitems(config, items):

markers_to_skip = {

"rocm": "--run-rocm",

"macm": "--run-macm",

}

for marker_name, option_name in markers_to_skip.items():

if not config.getoption(option_name):

skip_marker = pytest.mark.skip(reason=f"need {option_name} option to run")

for item in items:

if marker_name in item.keywords:

item.add_marker(skip_marker)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b2cf0acf9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T09:34:36Z

src/keep_gpu/global_gpu_controller/global_gpu_controller.py

+        if self.computing_platform == ComputingPlatform.MACM:
+            self.gpu_ids = [0]


Honor caller-supplied GPU IDs on MACM

This branch unconditionally rewrites gpu_ids to [0] on MACM, so explicit inputs like gpu_ids=[] or gpu_ids=[3] are silently ignored and still start work on device 0. That makes API/CLI behavior inconsistent with CUDA/ROCm and can unexpectedly occupy the GPU when callers expected validation or no-op behavior; preserve None -> [0] but validate/reject incompatible explicit IDs instead of overriding them.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-09T09:34:36Z

README.md

+- **Mac M series (M1/M2/M3/M4)**
+  ```bash
+  pip install torch
+  pip install keep-gpu[macm]


Remove unsupported macm extra from install command

The new installation snippet tells users to run pip install keep-gpu[macm], but this repository’s pyproject.toml only defines dev and rocm extras, so macm is not a real extra. Users following this path get an unknown-extra install warning and no guaranteed Mac-specific dependency set, which makes onboarding and support guidance unreliable.

Useful? React with 👍 / 👎.

coderabbitai

🧹 Nitpick comments (4)

src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)

51-56: Consider warning when user-provided gpu_ids are overridden.

When computing_platform == MACM, user-provided gpu_ids are silently replaced with [0]. This could surprise users who explicitly pass gpu_ids=[1] or expect multi-GPU behavior. The MCP server also passes through gpu_ids without platform awareness (see src/keep_gpu/mcp/server.py lines 89-117).

Consider logging a warning when non-default gpu_ids are overridden:

💡 Proposed enhancement

         if self.computing_platform == ComputingPlatform.MACM:
+            if gpu_ids is not None and gpu_ids != [0]:
+                import logging
+                logging.getLogger(__name__).warning(
+                    "MPS only supports device 0; ignoring gpu_ids=%s", gpu_ids
+                )
             self.gpu_ids = [0]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/keep_gpu/global_gpu_controller/global_gpu_controller.py` around lines 51
- 56, In GlobalGPUController (the block that checks computing_platform ==
ComputingPlatform.MACM and assigns self.gpu_ids = [0]), detect when an explicit
gpu_ids was provided (gpu_ids is not None) and log a warning that the
user-provided gpu_ids are being overridden due to MACM platform constraints;
include both the original gpu_ids value and the enforced [0] in the warning
message and use the existing logger (or Python logging) so callers (and the MCP
server path) can see the override.

docs/getting-started.md (1)

86-87: Consider adding Mac-specific troubleshooting guidance.

The troubleshooting section only references CUDA errors and the benchmark tool. Mac M series users encountering MPS issues may not find this helpful. Consider adding a brief note for MPS troubleshooting or clarifying that the benchmark is CUDA-specific.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/getting-started.md` around lines 86 - 87, Add a short Mac-specific
troubleshooting note to the existing CUDA benchmark guidance: clarify that
`python -m keep_gpu.benchmark` is CUDA-specific, and add a line addressing Apple
Silicon/MPS users (mention MPS-related failures, suggest toggling MPS or
checking PyTorch MPS backend, macOS and Xcode/toolchain updates, and refer users
to MPS docs or a troubleshooting link). Include this alongside the existing CUDA
sentence so MPS users know the benchmark won't apply and what to try instead.

src/keep_gpu/single_gpu_controller/macm_gpu_controller.py (2)

122-127: Use logger.exception to capture full traceback on allocation failure.

Per static analysis hint TRY400, logger.exception should be used instead of logger.error when logging within an exception handler to include the traceback.

♻️ Proposed fix

             except RuntimeError as exc:
-                logger.error("rank %s: failed to allocate tensor: %s", self.rank, exc)
+                logger.exception("rank %s: failed to allocate tensor", self.rank)
                 torch.mps.empty_cache()
                 gc.collect()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/keep_gpu/single_gpu_controller/macm_gpu_controller.py` around lines 122 -
127, Replace the logger.error call in the except RuntimeError handler with
logger.exception so the full traceback is captured; in the exception block
inside the method in macm_gpu_controller.py where RuntimeError is caught (the
block that currently calls logger.error("rank %s: failed to allocate tensor:
%s", self.rank, exc)), change to logger.exception with the same contextual
message (including self.rank) and then keep the existing cleanup calls
torch.mps.empty_cache() and gc.collect() and the stop_evt.wait(self.interval)
return logic unchanged.

138-143: Log OOM errors for observability.

When an out-of-memory condition is caught, the error is silently handled without logging. This makes it harder to diagnose memory pressure issues. Consider logging a warning when OOM occurs.

♻️ Proposed fix

             except RuntimeError as exc:
                 if "out of memory" in str(exc).lower():
+                    logger.warning("rank %s: MPS OOM, clearing cache", self.rank)
                     torch.mps.empty_cache()
                     gc.collect()
+                else:
+                    logger.exception("rank %s: runtime error in keep loop", self.rank)
                 if stop_evt.wait(self.interval):
                     break

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/keep_gpu/single_gpu_controller/macm_gpu_controller.py` around lines 138 -
143, The except block catching RuntimeError in the polling loop (the clause
"except RuntimeError as exc") swallows OOMs without logging; update it to log a
warning when "out of memory" is detected by calling the project's logger (or
processLogger) with the exception details before running torch.mps.empty_cache()
and gc.collect(), e.g., emit a message that includes exc and context (e.g.,
function/class macm_gpu_controller polling loop), and keep the existing
stop_evt.wait(self.interval) behavior unchanged so observability is improved
without changing control flow.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/getting-started.md`:
- Around line 86-87: Add a short Mac-specific troubleshooting note to the
existing CUDA benchmark guidance: clarify that `python -m keep_gpu.benchmark` is
CUDA-specific, and add a line addressing Apple Silicon/MPS users (mention
MPS-related failures, suggest toggling MPS or checking PyTorch MPS backend,
macOS and Xcode/toolchain updates, and refer users to MPS docs or a
troubleshooting link). Include this alongside the existing CUDA sentence so MPS
users know the benchmark won't apply and what to try instead.

In `@src/keep_gpu/global_gpu_controller/global_gpu_controller.py`:
- Around line 51-56: In GlobalGPUController (the block that checks
computing_platform == ComputingPlatform.MACM and assigns self.gpu_ids = [0]),
detect when an explicit gpu_ids was provided (gpu_ids is not None) and log a
warning that the user-provided gpu_ids are being overridden due to MACM platform
constraints; include both the original gpu_ids value and the enforced [0] in the
warning message and use the existing logger (or Python logging) so callers (and
the MCP server path) can see the override.

In `@src/keep_gpu/single_gpu_controller/macm_gpu_controller.py`:
- Around line 122-127: Replace the logger.error call in the except RuntimeError
handler with logger.exception so the full traceback is captured; in the
exception block inside the method in macm_gpu_controller.py where RuntimeError
is caught (the block that currently calls logger.error("rank %s: failed to
allocate tensor: %s", self.rank, exc)), change to logger.exception with the same
contextual message (including self.rank) and then keep the existing cleanup
calls torch.mps.empty_cache() and gc.collect() and the
stop_evt.wait(self.interval) return logic unchanged.
- Around line 138-143: The except block catching RuntimeError in the polling
loop (the clause "except RuntimeError as exc") swallows OOMs without logging;
update it to log a warning when "out of memory" is detected by calling the
project's logger (or processLogger) with the exception details before running
torch.mps.empty_cache() and gc.collect(), e.g., emit a message that includes exc
and context (e.g., function/class macm_gpu_controller polling loop), and keep
the existing stop_evt.wait(self.interval) behavior unchanged so observability is
improved without changing control flow.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 757016a6-c92d-4fdb-8aec-2851953721e0

📥 Commits

Reviewing files that changed from the base of the PR and between 135bc7c and 8b2cf0a.

📒 Files selected for processing (12)

README.md
docs/concepts/architecture.md
docs/getting-started.md
pyproject.toml
skills/gpu-keepalive-with-keepgpu/SKILL.md
src/keep_gpu/global_gpu_controller/global_gpu_controller.py
src/keep_gpu/single_gpu_controller/__init__.py
src/keep_gpu/single_gpu_controller/macm_gpu_controller.py
src/keep_gpu/utilities/platform_manager.py
tests/conftest.py
tests/macm_controller/__init__.py
tests/macm_controller/test_macm_basic.py

- Add macm extras to pyproject.toml (was missing) - Validate gpu_ids in GlobalGPUController for MACM platform - Accept None -> [0] or [0] - Raise ValueError for other gpu_ids values

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

style: fix trailing whitespace in architecture.md

aa9d0ec

chatgpt-codex-connector bot reviewed Mar 9, 2026

View reviewed changes

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

fix: address PR review feedback

8a020ab

- Add macm extras to pyproject.toml (was missing) - Validate gpu_ids in GlobalGPUController for MACM platform - Accept None -> [0] or [0] - Raise ValueError for other gpu_ids values

Wangmerlyn merged commit 093e946 into main Mar 9, 2026
5 checks passed

	self._num_elements = int(self.vram_to_keep)
	self._num_elements = int(self.vram_to_keep) // 4

		if self.computing_platform == ComputingPlatform.MACM:
		self.gpu_ids = [0]

Conversation

Wangmerlyn commented Mar 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Implementation

Platform Support

Integration

Configuration

Documentation

Known Limitations

Testing

Installation

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot commented Mar 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Wangmerlyn commented Mar 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading