Skip to content

Metal: correctness-gate M5 Max 4096 prefill (+5%)#149

Open
fitchmultz wants to merge 3 commits into
antirez:mainfrom
fitchmultz:m5-responses
Open

Metal: correctness-gate M5 Max 4096 prefill (+5%)#149
fitchmultz wants to merge 3 commits into
antirez:mainfrom
fitchmultz:m5-responses

Conversation

@fitchmultz
Copy link
Copy Markdown

@fitchmultz fitchmultz commented May 14, 2026

Result

This PR adds an Apple M5 Max-tuned Metal path and keeps the official-vector and long-context checks green. Non-M5 devices keep the existing 2048-token default.

Fresh single-run comparison against origin/main on an Apple M5 Max 128GB machine, Metal backend, ds4flash.gguf:

benchmark origin/main this PR result
4096-step sweep, avg prefill 258.06 t/s 270.88 t/s +5.0%
4096-step sweep, avg generation 26.82 t/s 26.60 t/s neutral / -0.8%
README 65k sweep, avg prefill 238.30 t/s 235.15 t/s neutral / -1.3%
README 65k sweep, avg generation 25.92 t/s 25.69 t/s neutral / -0.9%

The main speed win is the M5 Max 4096-token prefill path. The README-shaped 2048-frontier sweep is effectively neutral in this single-run measurement, so I am not claiming a broad speedup there.

Correctness checks pass, including official logprob vectors with the new 4096-token chunk path.

What changed

  • Adds Apple M5-gated Metal runtime fast paths:
    • simdgroup matrix matmul specialization
    • private Metal scratch buffers for GPU-only hot intermediates, keeping hazard tracking enabled
  • Makes 4096-token prefill chunks the default only on Apple M5 Max Metal.
  • Keeps other devices/backends on the existing 2048-token default.
  • Makes the 4096-token path correctness-safe by splitting the zero-prefix first chunk at the existing 2048-token correctness boundary. This avoids selecting compressed top-k rows from future causal positions.
  • Aligns server KV disk-cache boundaries to the backend prefill chunk:
    • M5 Max Metal: 4096
    • other devices/backends: 2048

DS4_METAL_PREFILL_CHUNK=2048 forces the previous M5 Max chunk size. Values above 4096 still require DS4_METAL_ALLOW_UNSAFE_PREFILL_CHUNK=1 on the M5 Max default path.

Correctness

Tested on Apple M5 Max, 128GB RAM, Metal backend, ds4flash.gguf.

Commands run:

make clean && make
make test
make cpu
make clean && make
DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors

make test covers:

  • --long-context
  • --tool-call-quality
  • --logprob-vectors
  • --metal-kernels
  • --server

I also smoked the server API surface after the final build:

  • GET /v1/models
  • POST /v1/chat/completions
  • POST /v1/responses

Server startup on the M5 Max now reports:

prefill_chunk=4096
KV disk cache ... align=4096

Benchmark commands

Fresh origin/main comparison used a separate worktree at origin/main and this branch, same model/prompt, same M5 Max machine.

4096-step sweep:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
  --gen-tokens 64

README-shaped sweep:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 --step-incr 2048 \
  --gen-tokens 128

Additional chunk-size comparison on this PR, useful for the M5 Max default choice:

benchmark 2048 chunk 4096 chunk result
4096-step sweep, avg prefill 261.55 t/s 281.48 t/s +7.6%
4096-step sweep, avg generation 24.25 t/s 25.31 t/s +4.4%

Memory

The 4096 default increases Metal context-buffer allocation but keeps it modest for the tested M5 Max class machine.

From ds4-bench context buffer reporting at the README 65k allocation:

chunk context buffers
2048 1311.89 MiB
4096 1740.42 MiB

That is about +0.4 GiB. Other devices keep the old 2048 default.

Scope notes

I inspected the M5 branch changes and did not include README-only or dataset/imatrix churn. This PR is intentionally limited to runtime Metal changes and the M5 Max prefill default.

@fitchmultz fitchmultz changed the title Metal: add M5 Max fast paths and 4096 prefill default Metal: speed up M5 Max prefill with correctness-gated 4096 chunks May 14, 2026
@fitchmultz fitchmultz changed the title Metal: speed up M5 Max prefill with correctness-gated 4096 chunks Metal: correctness-gate M5 Max 4096 prefill (+5%) May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant