[controllers] feat: add Apple Silicon MPS support#70
Conversation
Add full support for Mac M series chips (M1/M2/M3/M4) using Metal Performance Shaders (MPS): - MacMGPUController: new controller using torch.mps backend - Platform detection: add MACM platform to platform_manager - GlobalGPUController: integrate MacMGPUController for MACM platform - GPU info: add _query_macm() for Apple Silicon GPU info - Tests: add tests/macm_controller/ with basic tests - Config: add macm extras to pyproject.toml - Docs: update README, getting-started, architecture, and SKILL.md Note: GPU utilization monitoring is not available on macOS due to system API limitations. The busy_threshold parameter is accepted for API compatibility but has no effect. [controllers] feat: add Mac M series support
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR adds macOS Apple Silicon (Mac M series) support as a new computing platform alongside CUDA and ROCm. It introduces MACM platform detection, a dedicated MacMGPUController for MPS-based memory management, integrates the new controller into platform selection logic, adds comprehensive documentation and guides, and establishes test infrastructure for Apple Silicon validation. Changes
Sequence DiagramsequenceDiagram
participant App as Application
participant GGC as GlobalGPUController
participant PM as PlatformManager
participant SGC as MacMGPUController
App->>GGC: __init__(gpu_ids=None)
GGC->>PM: detect_platform()
PM->>PM: check_cuda()
Note over PM: Not detected
PM->>PM: check_rocm()
Note over PM: Not detected
PM->>PM: check_macm()
Note over PM: MPS available on Apple Silicon
PM-->>GGC: ComputingPlatform.MACM
GGC->>GGC: controller_cls = MacMGPUController
GGC->>GGC: gpu_ids = [0]
GGC->>SGC: __init__(rank=0, vram_to_keep="1000 MB")
SGC->>SGC: device="mps"<br/>platform=MACM
SGC-->>GGC: controller initialized
GGC-->>App: ready
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the application's hardware compatibility by introducing comprehensive support for Apple Silicon Macs. It integrates the Metal Performance Shaders (MPS) backend, allowing users on M-series chips to leverage their GPU resources effectively. The changes encompass core controller implementation, platform detection, and extensive documentation updates, making the application accessible to a broader user base on macOS. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for Apple Silicon (Mac M-series) chips using the MPS backend. A security review found no vulnerabilities. However, the review identified a few issues that need to be addressed before merging: the macm optional dependency is missing from pyproject.toml, a bug in MacMGPUController causes it to allocate four times the requested amount of VRAM, and there's an opportunity to refactor duplicated code in the test configuration. Once these are resolved, this great addition should be ready to merge.
| [tool.pytest.ini_options] | ||
| markers = [ | ||
| "rocm: tests that require ROCm stack", | ||
| "macm: tests that require Apple Silicon with MPS", |
There was a problem hiding this comment.
The macm optional dependency is missing. The installation instructions in the documentation use pip install keep-gpu[macm], which will fail without this definition. According to the PR description, this should also include the psutil dependency.
Please add the macm extra to [project.optional-dependencies]. For example:
[project.optional-dependencies]
dev = [...]
rocm = [...]
macm = [
"psutil",
]| logger.warning("rank %s: keep thread already running", self.rank) | ||
| return | ||
|
|
||
| self._num_elements = int(self.vram_to_keep) |
There was a problem hiding this comment.
There appears to be a miscalculation in the number of tensor elements. self.vram_to_keep holds the requested memory in bytes, but it's being used directly as the number of elements for a torch.float32 tensor. Since each float32 element occupies 4 bytes, this results in allocating 4 times the requested VRAM.
To allocate the correct amount of memory, you should divide the number of bytes by the size of the element type (4 for float32).
| self._num_elements = int(self.vram_to_keep) | |
| self._num_elements = int(self.vram_to_keep) // 4 |
| def pytest_collection_modifyitems(config, items): | ||
| if config.getoption("--run-rocm"): | ||
| return | ||
| if not config.getoption("--run-rocm"): | ||
| skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run") | ||
| for item in items: | ||
| if "rocm" in item.keywords: | ||
| item.add_marker(skip_rocm) | ||
|
|
||
| skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run") | ||
| for item in items: | ||
| if "rocm" in item.keywords: | ||
| item.add_marker(skip_rocm) | ||
| if not config.getoption("--run-macm"): | ||
| skip_macm = pytest.mark.skip(reason="need --run-macm option to run") | ||
| for item in items: | ||
| if "macm" in item.keywords: | ||
| item.add_marker(skip_macm) |
There was a problem hiding this comment.
The logic for skipping tests based on markers is duplicated for rocm and macm. You could refactor this into a loop to make it more maintainable and easier to add new markers in the future.
| def pytest_collection_modifyitems(config, items): | |
| if config.getoption("--run-rocm"): | |
| return | |
| if not config.getoption("--run-rocm"): | |
| skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run") | |
| for item in items: | |
| if "rocm" in item.keywords: | |
| item.add_marker(skip_rocm) | |
| skip_rocm = pytest.mark.skip(reason="need --run-rocm option to run") | |
| for item in items: | |
| if "rocm" in item.keywords: | |
| item.add_marker(skip_rocm) | |
| if not config.getoption("--run-macm"): | |
| skip_macm = pytest.mark.skip(reason="need --run-macm option to run") | |
| for item in items: | |
| if "macm" in item.keywords: | |
| item.add_marker(skip_macm) | |
| def pytest_collection_modifyitems(config, items): | |
| markers_to_skip = { | |
| "rocm": "--run-rocm", | |
| "macm": "--run-macm", | |
| } | |
| for marker_name, option_name in markers_to_skip.items(): | |
| if not config.getoption(option_name): | |
| skip_marker = pytest.mark.skip(reason=f"need {option_name} option to run") | |
| for item in items: | |
| if marker_name in item.keywords: | |
| item.add_marker(skip_marker) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8b2cf0acf9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.computing_platform == ComputingPlatform.MACM: | ||
| self.gpu_ids = [0] |
There was a problem hiding this comment.
Honor caller-supplied GPU IDs on MACM
This branch unconditionally rewrites gpu_ids to [0] on MACM, so explicit inputs like gpu_ids=[] or gpu_ids=[3] are silently ignored and still start work on device 0. That makes API/CLI behavior inconsistent with CUDA/ROCm and can unexpectedly occupy the GPU when callers expected validation or no-op behavior; preserve None -> [0] but validate/reject incompatible explicit IDs instead of overriding them.
Useful? React with 👍 / 👎.
| - **Mac M series (M1/M2/M3/M4)** | ||
| ```bash | ||
| pip install torch | ||
| pip install keep-gpu[macm] |
There was a problem hiding this comment.
Remove unsupported macm extra from install command
The new installation snippet tells users to run pip install keep-gpu[macm], but this repository’s pyproject.toml only defines dev and rocm extras, so macm is not a real extra. Users following this path get an unknown-extra install warning and no guaranteed Mac-specific dependency set, which makes onboarding and support guidance unreliable.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🧹 Nitpick comments (4)
src/keep_gpu/global_gpu_controller/global_gpu_controller.py (1)
51-56: Consider warning when user-providedgpu_idsare overridden.When
computing_platform == MACM, user-providedgpu_idsare silently replaced with[0]. This could surprise users who explicitly passgpu_ids=[1]or expect multi-GPU behavior. The MCP server also passes throughgpu_idswithout platform awareness (seesrc/keep_gpu/mcp/server.pylines 89-117).Consider logging a warning when non-default
gpu_idsare overridden:💡 Proposed enhancement
if self.computing_platform == ComputingPlatform.MACM: + if gpu_ids is not None and gpu_ids != [0]: + import logging + logging.getLogger(__name__).warning( + "MPS only supports device 0; ignoring gpu_ids=%s", gpu_ids + ) self.gpu_ids = [0]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/keep_gpu/global_gpu_controller/global_gpu_controller.py` around lines 51 - 56, In GlobalGPUController (the block that checks computing_platform == ComputingPlatform.MACM and assigns self.gpu_ids = [0]), detect when an explicit gpu_ids was provided (gpu_ids is not None) and log a warning that the user-provided gpu_ids are being overridden due to MACM platform constraints; include both the original gpu_ids value and the enforced [0] in the warning message and use the existing logger (or Python logging) so callers (and the MCP server path) can see the override.docs/getting-started.md (1)
86-87: Consider adding Mac-specific troubleshooting guidance.The troubleshooting section only references CUDA errors and the benchmark tool. Mac M series users encountering MPS issues may not find this helpful. Consider adding a brief note for MPS troubleshooting or clarifying that the benchmark is CUDA-specific.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/getting-started.md` around lines 86 - 87, Add a short Mac-specific troubleshooting note to the existing CUDA benchmark guidance: clarify that `python -m keep_gpu.benchmark` is CUDA-specific, and add a line addressing Apple Silicon/MPS users (mention MPS-related failures, suggest toggling MPS or checking PyTorch MPS backend, macOS and Xcode/toolchain updates, and refer users to MPS docs or a troubleshooting link). Include this alongside the existing CUDA sentence so MPS users know the benchmark won't apply and what to try instead.src/keep_gpu/single_gpu_controller/macm_gpu_controller.py (2)
122-127: Uselogger.exceptionto capture full traceback on allocation failure.Per static analysis hint TRY400,
logger.exceptionshould be used instead oflogger.errorwhen logging within an exception handler to include the traceback.♻️ Proposed fix
except RuntimeError as exc: - logger.error("rank %s: failed to allocate tensor: %s", self.rank, exc) + logger.exception("rank %s: failed to allocate tensor", self.rank) torch.mps.empty_cache() gc.collect()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/keep_gpu/single_gpu_controller/macm_gpu_controller.py` around lines 122 - 127, Replace the logger.error call in the except RuntimeError handler with logger.exception so the full traceback is captured; in the exception block inside the method in macm_gpu_controller.py where RuntimeError is caught (the block that currently calls logger.error("rank %s: failed to allocate tensor: %s", self.rank, exc)), change to logger.exception with the same contextual message (including self.rank) and then keep the existing cleanup calls torch.mps.empty_cache() and gc.collect() and the stop_evt.wait(self.interval) return logic unchanged.
138-143: Log OOM errors for observability.When an out-of-memory condition is caught, the error is silently handled without logging. This makes it harder to diagnose memory pressure issues. Consider logging a warning when OOM occurs.
♻️ Proposed fix
except RuntimeError as exc: if "out of memory" in str(exc).lower(): + logger.warning("rank %s: MPS OOM, clearing cache", self.rank) torch.mps.empty_cache() gc.collect() + else: + logger.exception("rank %s: runtime error in keep loop", self.rank) if stop_evt.wait(self.interval): break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/keep_gpu/single_gpu_controller/macm_gpu_controller.py` around lines 138 - 143, The except block catching RuntimeError in the polling loop (the clause "except RuntimeError as exc") swallows OOMs without logging; update it to log a warning when "out of memory" is detected by calling the project's logger (or processLogger) with the exception details before running torch.mps.empty_cache() and gc.collect(), e.g., emit a message that includes exc and context (e.g., function/class macm_gpu_controller polling loop), and keep the existing stop_evt.wait(self.interval) behavior unchanged so observability is improved without changing control flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/getting-started.md`:
- Around line 86-87: Add a short Mac-specific troubleshooting note to the
existing CUDA benchmark guidance: clarify that `python -m keep_gpu.benchmark` is
CUDA-specific, and add a line addressing Apple Silicon/MPS users (mention
MPS-related failures, suggest toggling MPS or checking PyTorch MPS backend,
macOS and Xcode/toolchain updates, and refer users to MPS docs or a
troubleshooting link). Include this alongside the existing CUDA sentence so MPS
users know the benchmark won't apply and what to try instead.
In `@src/keep_gpu/global_gpu_controller/global_gpu_controller.py`:
- Around line 51-56: In GlobalGPUController (the block that checks
computing_platform == ComputingPlatform.MACM and assigns self.gpu_ids = [0]),
detect when an explicit gpu_ids was provided (gpu_ids is not None) and log a
warning that the user-provided gpu_ids are being overridden due to MACM platform
constraints; include both the original gpu_ids value and the enforced [0] in the
warning message and use the existing logger (or Python logging) so callers (and
the MCP server path) can see the override.
In `@src/keep_gpu/single_gpu_controller/macm_gpu_controller.py`:
- Around line 122-127: Replace the logger.error call in the except RuntimeError
handler with logger.exception so the full traceback is captured; in the
exception block inside the method in macm_gpu_controller.py where RuntimeError
is caught (the block that currently calls logger.error("rank %s: failed to
allocate tensor: %s", self.rank, exc)), change to logger.exception with the same
contextual message (including self.rank) and then keep the existing cleanup
calls torch.mps.empty_cache() and gc.collect() and the
stop_evt.wait(self.interval) return logic unchanged.
- Around line 138-143: The except block catching RuntimeError in the polling
loop (the clause "except RuntimeError as exc") swallows OOMs without logging;
update it to log a warning when "out of memory" is detected by calling the
project's logger (or processLogger) with the exception details before running
torch.mps.empty_cache() and gc.collect(), e.g., emit a message that includes exc
and context (e.g., function/class macm_gpu_controller polling loop), and keep
the existing stop_evt.wait(self.interval) behavior unchanged so observability is
improved without changing control flow.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 757016a6-c92d-4fdb-8aec-2851953721e0
📒 Files selected for processing (12)
README.mddocs/concepts/architecture.mddocs/getting-started.mdpyproject.tomlskills/gpu-keepalive-with-keepgpu/SKILL.mdsrc/keep_gpu/global_gpu_controller/global_gpu_controller.pysrc/keep_gpu/single_gpu_controller/__init__.pysrc/keep_gpu/single_gpu_controller/macm_gpu_controller.pysrc/keep_gpu/utilities/platform_manager.pytests/conftest.pytests/macm_controller/__init__.pytests/macm_controller/test_macm_basic.py
- Add macm extras to pyproject.toml (was missing) - Validate gpu_ids in GlobalGPUController for MACM platform - Accept None -> [0] or [0] - Raise ValueError for other gpu_ids values
Summary
Add full support for Mac M series chips (M1/M2/M3/M4) using Metal Performance Shaders (MPS).
Changes
Core Implementation
Platform Support
Integration
Configuration
Documentation
Known Limitations
Testing
On Apple Silicon Mac with PyTorch MPS support:
pip install -e .[macm]
pytest tests/macm_controller/ --run-macm -v
Installation
Summary by CodeRabbit
New Features
Documentation