[MLX] Reduce physical footprint memory in RingBufferKVCache for chunked prefill#20341
Conversation
Replace the export-time GGUF-to-MLX qparam repack path with fused Metal kernels
Keep the legacy MLX-native repack path available when the env var is set to 0, per maintainer request on #20172.
…type handling, add legacy-path test coverage, and harden the embedding kernel.
…s stays unchanged.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20341
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 16 Pending, 2 Unrelated Failures, 2 Unclassified FailuresAs of commit d91d110 with merge base 0eb8247 ( NEW FAILURES - The following jobs have failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@claude review this code |
This PR needs a
|
|
Claude finished @metascroy's task in 6m 39s —— View job Code ReviewI reviewed the diff against Tasks
Correctness ✅The buffer-size derivation checks out. During a single multi-token write of
The union spans The Minor: docstring vs. actual defaultThe default is now Suggestion: test coverage for the new path
Observation: generic HF path not wired
Style
Nothing blocking — the math is sound and the change does what it claims. The only real ask is the missing test for the reduced-buffer path; the docstring wording is a nit. |
|
Can't similar optimization apply to CUDA too? cc @Gasoonjia |
94a5797 to
d91d110
Compare
|
@mergennachin I think it can but due to different ring kv cache impl we may need to have differetn approch. Not a blocker though. |
When doing chunked prefill, the RingBufferKVCache does not need 2x window size, but instead window_size + max_write_length - 1 (prefill chunk size). This PR exposes that knob and wires it to gemma4 31b MLX export, which uses chunk_size 256, smaller than gemma4's window size (1024).
Reduces phys_footprint on a 4K export by around −0.68 GiB (from 13.84 GiB to 13.16 GiB).