Metal: correctness-gate M5 Max 4096 prefill (+5%)#149
Open
fitchmultz wants to merge 3 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Result
This PR adds an Apple M5 Max-tuned Metal path and keeps the official-vector and long-context checks green. Non-M5 devices keep the existing 2048-token default.
Fresh single-run comparison against
origin/mainon an Apple M5 Max 128GB machine, Metal backend,ds4flash.gguf:The main speed win is the M5 Max 4096-token prefill path. The README-shaped 2048-frontier sweep is effectively neutral in this single-run measurement, so I am not claiming a broad speedup there.
Correctness checks pass, including official logprob vectors with the new 4096-token chunk path.
What changed
DS4_METAL_PREFILL_CHUNK=2048forces the previous M5 Max chunk size. Values above 4096 still requireDS4_METAL_ALLOW_UNSAFE_PREFILL_CHUNK=1on the M5 Max default path.Correctness
Tested on Apple M5 Max, 128GB RAM, Metal backend,
ds4flash.gguf.Commands run:
make testcovers:--long-context--tool-call-quality--logprob-vectors--metal-kernels--serverI also smoked the server API surface after the final build:
GET /v1/modelsPOST /v1/chat/completionsPOST /v1/responsesServer startup on the M5 Max now reports:
Benchmark commands
Fresh origin/main comparison used a separate worktree at
origin/mainand this branch, same model/prompt, same M5 Max machine.4096-step sweep:
README-shaped sweep:
Additional chunk-size comparison on this PR, useful for the M5 Max default choice:
Memory
The 4096 default increases Metal context-buffer allocation but keeps it modest for the tested M5 Max class machine.
From
ds4-benchcontext buffer reporting at the README 65k allocation:That is about +0.4 GiB. Other devices keep the old 2048 default.
Scope notes
I inspected the M5 branch changes and did not include README-only or dataset/imatrix churn. This PR is intentionally limited to runtime Metal changes and the M5 Max prefill default.