[MLX] Reduce physical footprint memory in RingBufferKVCache for chunked prefill by metascroy · Pull Request #20341 · pytorch/executorch

metascroy · 2026-06-17T18:20:28Z

When doing chunked prefill, the RingBufferKVCache does not need 2x window size, but instead window_size + max_write_length - 1 (prefill chunk size). This PR exposes that knob and wires it to gemma4 31b MLX export, which uses chunk_size 256, smaller than gemma4's window size (1024).

Reduces phys_footprint on a 4K export by around −0.68 GiB (from 13.84 GiB to 13.16 GiB).

Replace the export-time GGUF-to-MLX qparam repack path with fused Metal kernels

Keep the legacy MLX-native repack path available when the env var is set to 0, per maintainer request on #20172.

…type handling, add legacy-path test coverage, and harden the embedding kernel.

…s stays unchanged.

pytorch-bot · 2026-06-17T18:20:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20341

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 16 Pending, 2 Unrelated Failures, 2 Unclassified Failures

As of commit d91d110 with merge base 0eb8247 ():

NEW FAILURES - The following jobs have failed:

pull / test-lora-multimethod-linux / linux-job (gh)
RuntimeError: Command docker exec -t 1abe344b4150640f50724a25c4278f4d2e966ee07b69f69b8a961e99ea674711 /exec failed with exit code 1
pull / test-moshi-linux / linux-job (gh)
RuntimeError: Command docker exec -t e792694c068cf4724b37674245865bde33bfa7367318ba60fc07aaf7ec93745f /exec failed with exit code 1
pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh)
RuntimeError: Command docker exec -t b3b6c956b18f40463ebf19aef9334000fe03653ab5ddd5da6aab3dfe13d6845d /exec failed with exit code 1

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Aarch64 Linux Wheels / pytorch/executorch / build-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
/__w/executorch/executorch/pytorch/executorch/backends/apple/coreml/runtime/inmemoryfs/inmemory_filesystem.cpp:722:48: error: ‘inmemoryfs::InMemoryFileSystem::InMemoryNode::Kind’ has not been declared
Build Aarch64 Linux Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_aarch64

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-arm-backend-no-driver (test_pytest_models_tosa) / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

metascroy · 2026-06-17T18:20:41Z

@claude review this code

github-actions · 2026-06-17T18:21:26Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

claude · 2026-06-17T18:22:08Z

Claude finished @metascroy's task in 6m 39s —— View job

Code Review

I reviewed the diff against origin/main. This is a clean, well-reasoned change. The core idea — that a sliding-window ring buffer needs window_size + max_write_len - 1 slots rather than 2 × window_size — is correct, and the implementation wires it through cleanly.

Tasks

Read PR diff and surrounding code
Verify the buffer-sizing math
Check downstream consumers of buffer_size / mask logic
Review test coverage
Post review

Correctness ✅

The buffer-size derivation checks out. During a single multi-token write of seq_len tokens at start_pos:

The earliest query (start_pos) needs positions back to start_pos − w + 1.
The latest query (start_pos + seq_len − 1) needs the freshly written tail.

The union spans (start_pos + seq_len − 1) − (start_pos − w + 1) + 1 = seq_len + w − 1 distinct positions that must coexist in the ring, so buffer_size ≥ window_size + max_write_len − 1 is exactly right. The create_sliding_window_mask reconstruction (backends/mlx/llm/cache.py:338-366) is already written generically in terms of buffer_size, so it stays correct for any b — no change needed there. Good.

The assert max_write_len <= window_size guard (cache.py:276-278) is the right invariant: if a write exceeded the window, slots within a single write could collide.

Minor: docstring vs. actual default

The default is now 2 * window_size − 1, not 2 * window_size. Both cache.py:235 ("a ~2× buffer") and mlx_source_transformations.py:184 ("defaults to the full window (the original 2× buffer)") describe the default as the original 2× buffer, but it's actually one slot smaller than before. The new value is mathematically correct (and a tiny bit leaner), so this is purely a doc-accuracy nit — consider saying "≈2× (2·window_size − 1)" so the −1 isn't surprising to a future reader comparing against the old 2 * max_context_length.

Suggestion: test coverage for the new path

RingBufferKVCacheTest (backends/mlx/test/test_ops.py:2732) only exercises the default buffer size — RingBufferKVCacheModel doesn't plumb max_write_len through. The reduced-buffer case is the one most likely to regress (a wrapping write into the tighter buffer + mask reconstruction), and it's currently untested. Worth adding a config where max_write_len < window_size and seq_step triggers a ring wrap, asserting numerics match a reference. Add a max_write_len test config →

Observation: generic HF path not wired

backends/mlx/llm/source_transformation.py:206 constructs RingBufferKVCache without max_write_len, so the generic HF sliding-window export still gets the full 2× buffer. That's consistent with the PR scope (gemma4 only), but the same memory win is available there if the export-time max prefill length is known — possible follow-up.

Style

max_write_len: int | None = None and the threading through both functions match the surrounding code well. Comments are appropriately scoped to the non-obvious invariant. No issues.

Nothing blocking — the math is sound and the change does what it claims. The only real ask is the missing test for the reduced-buffer path; the docstring wording is a nit.
· branch reduce-mem-sliding-kv

mergennachin · 2026-06-17T19:07:44Z

Can't similar optimization apply to CUDA too?

cc @Gasoonjia

Gasoonjia · 2026-06-17T23:41:55Z

@mergennachin I think it can but due to different ring kv cache impl we may need to have differetn approch. Not a blocker though.

uddeshsingh and others added 12 commits June 12, 2026 00:34

Add fused Q4_K Metal kernels for GGUF lowering (#20172)

eda40b8

Replace the export-time GGUF-to-MLX qparam repack path with fused Metal kernels

Guard Q4_K GGUF lowering behind ET_MLX_EMIT_DIRECT_GGUF

49ac1d2

Keep the legacy MLX-native repack path available when the env var is set to 0, per maintainer request on #20172.

Merge branch 'pytorch:main' into uddeshsingh/q4k-fused-kernels

1f7be36

Merge branch 'pytorch:main' into uddeshsingh/q4k-fused-kernels

bb6b564

Extract emit_if_else/emit_sub_int/emit_ceil_div helpers, fix output d…

c23c9e4

…type handling, add legacy-path test coverage, and harden the embedding kernel.

Move Q4_K env-var dispatch into emit_linear/emit_embedding so pattern…

bbfe5d6

…s stays unchanged.

Move Q4_K embedding env-var dispatch fully into emit_embedding.

e6ebd60

Simplify GGUF linear emit via emit_if_else constant folding

b587800

up

9a6ec9a

up

0c8bbdd

up

db628f8

Merge branch 'main' into reduce-mem-sliding-kv

a096d94

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 17, 2026

metascroy temporarily deployed to cadence June 17, 2026 18:20 — with GitHub Actions Inactive

metascroy requested review from digantdesai and mergennachin June 17, 2026 18:20

mergennachin requested a review from Gasoonjia June 17, 2026 19:07

up

934dca5

metascroy temporarily deployed to cadence June 17, 2026 20:49 — with GitHub Actions Inactive

metascroy temporarily deployed to cadence June 17, 2026 22:06 — with GitHub Actions Inactive

up

d91d110

metascroy force-pushed the reduce-mem-sliding-kv branch from 94a5797 to d91d110 Compare June 17, 2026 22:32

metascroy temporarily deployed to cadence June 17, 2026 22:32 — with GitHub Actions Inactive

metascroy added the ciflow/mlx label Jun 17, 2026

Gasoonjia approved these changes Jun 17, 2026

View reviewed changes

metascroy merged commit 66feb0a into main Jun 17, 2026
238 of 254 checks passed

metascroy deleted the reduce-mem-sliding-kv branch June 17, 2026 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLX] Reduce physical footprint memory in RingBufferKVCache for chunked prefill#20341

[MLX] Reduce physical footprint memory in RingBufferKVCache for chunked prefill#20341
metascroy merged 14 commits into
mainfrom
reduce-mem-sliding-kv

metascroy commented Jun 17, 2026

Uh oh!

pytorch-bot Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

metascroy commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

claude Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

mergennachin commented Jun 17, 2026

Uh oh!

Gasoonjia commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

metascroy commented Jun 17, 2026

Uh oh!

pytorch-bot Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20341

❌ 3 New Failures, 16 Pending, 2 Unrelated Failures, 2 Unclassified Failures

Uh oh!

metascroy commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

This PR needs a release notes: label

Uh oh!

claude Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Correctness ✅

Minor: docstring vs. actual default

Suggestion: test coverage for the new path

Observation: generic HF path not wired

Style

Uh oh!

mergennachin commented Jun 17, 2026

Uh oh!

Gasoonjia commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot Bot commented Jun 17, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 17, 2026 •

edited

Loading