Metal: correctness-gate M5 Max 4096 prefill (+5%) by fitchmultz · Pull Request #149 · antirez/ds4

fitchmultz · 2026-05-14T21:54:35Z

Result

This PR adds an Apple M5 Max-tuned Metal path and keeps the official-vector and long-context checks green. Non-M5 devices keep the existing 2048-token default.

Fresh single-run comparison against origin/main on an Apple M5 Max 128GB machine, Metal backend, ds4flash.gguf:

benchmark	origin/main	this PR	result
4096-step sweep, avg prefill	258.06 t/s	270.88 t/s	+5.0%
4096-step sweep, avg generation	26.82 t/s	26.60 t/s	neutral / -0.8%
README 65k sweep, avg prefill	238.30 t/s	235.15 t/s	neutral / -1.3%
README 65k sweep, avg generation	25.92 t/s	25.69 t/s	neutral / -0.9%

The main speed win is the M5 Max 4096-token prefill path. The README-shaped 2048-frontier sweep is effectively neutral in this single-run measurement, so I am not claiming a broad speedup there.

Correctness checks pass, including official logprob vectors with the new 4096-token chunk path.

What changed

Adds Apple M5-gated Metal runtime fast paths:
- simdgroup matrix matmul specialization
- private Metal scratch buffers for GPU-only hot intermediates, keeping hazard tracking enabled
Makes 4096-token prefill chunks the default only on Apple M5 Max Metal.
Keeps other devices/backends on the existing 2048-token default.
Makes the 4096-token path correctness-safe by splitting the zero-prefix first chunk at the existing 2048-token correctness boundary. This avoids selecting compressed top-k rows from future causal positions.
Aligns server KV disk-cache boundaries to the backend prefill chunk:
- M5 Max Metal: 4096
- other devices/backends: 2048

DS4_METAL_PREFILL_CHUNK=2048 forces the previous M5 Max chunk size. Values above 4096 still require DS4_METAL_ALLOW_UNSAFE_PREFILL_CHUNK=1 on the M5 Max default path.

Correctness

Tested on Apple M5 Max, 128GB RAM, Metal backend, ds4flash.gguf.

Commands run:

make clean && make
make test
make cpu
make clean && make
DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors

make test covers:

--long-context
--tool-call-quality
--logprob-vectors
--metal-kernels
--server

I also smoked the server API surface after the final build:

GET /v1/models
POST /v1/chat/completions
POST /v1/responses

Server startup on the M5 Max now reports:

prefill_chunk=4096
KV disk cache ... align=4096

Benchmark commands

Fresh origin/main comparison used a separate worktree at origin/main and this branch, same model/prompt, same M5 Max machine.

4096-step sweep:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 4096 --ctx-max 32768 --step-incr 4096 \
  --gen-tokens 64

README-shaped sweep:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 --step-incr 2048 \
  --gen-tokens 128

Additional chunk-size comparison on this PR, useful for the M5 Max default choice:

benchmark	2048 chunk	4096 chunk	result
4096-step sweep, avg prefill	261.55 t/s	281.48 t/s	+7.6%
4096-step sweep, avg generation	24.25 t/s	25.31 t/s	+4.4%

Memory

The 4096 default increases Metal context-buffer allocation but keeps it modest for the tested M5 Max class machine.

From ds4-bench context buffer reporting at the README 65k allocation:

chunk	context buffers
2048	1311.89 MiB
4096	1740.42 MiB

That is about +0.4 GiB. Other devices keep the old 2048 default.

Scope notes

I inspected the M5 branch changes and did not include README-only or dataset/imatrix churn. This PR is intentionally limited to runtime Metal changes and the M5 Max prefill default.

fitchmultz added 2 commits May 14, 2026 15:43

metal: add M5 Max runtime fast paths

0b850f1

metal: default M5 Max to safe 4096 prefill

c1ee32a

fitchmultz changed the title ~~Metal: add M5 Max fast paths and 4096 prefill default~~ Metal: speed up M5 Max prefill with correctness-gated 4096 chunks May 14, 2026

fitchmultz changed the title ~~Metal: speed up M5 Max prefill with correctness-gated 4096 chunks~~ Metal: correctness-gate M5 Max 4096 prefill (+5%) May 14, 2026

Merge branch 'antirez:main' into m5-responses

e54b952

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal: correctness-gate M5 Max 4096 prefill (+5%)#149

Metal: correctness-gate M5 Max 4096 prefill (+5%)#149
fitchmultz wants to merge 3 commits into
antirez:mainfrom
fitchmultz:m5-responses

fitchmultz commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fitchmultz commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Result

What changed

Correctness

Benchmark commands

Memory

Scope notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fitchmultz commented May 14, 2026 •

edited

Loading