Add TurboQuant PolarQuant KV cache compression (turbo2/turbo3/turbo4)#332
Add TurboQuant PolarQuant KV cache compression (turbo2/turbo3/turbo4)#332ShrekDino wants to merge 1 commit into
Conversation
Implements TurboQuant+ KV cache compression for MLX: - PolarQuant: norm extraction + WHT rotation + Lloyd-Max scalar quantization - Fast Walsh-Hadamard Transform for O(d log d) rotation - Precomputed Lloyd-Max centroids for N(0, 1/d) distribution - TurboQuantKVCache class compatible with MLX's KVCache interface The cache stores quantized indices (uint8) + norms (fp32) instead of full-precision keys/values, achieving: - turbo2: 2-bit, ~6.4x compression - turbo3: 3-bit, ~4.6x compression - turbo4: 4-bit, ~3.8x compression See github.com/TheTom/turboquant_plus for the reference implementation.
|
I have read the CLA Document and I hereby sign the CLA ShrekDino seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 30ece36b94
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if live_cache_size is None: | ||
| self._live_tokens = None | ||
| self._live_cache = self._make_cache() | ||
| self._live_tokens = None |
There was a problem hiding this comment.
Restore cancellation branch indentation
Importing cache_wrapper.py now fails with IndentationError: expected an indented block after 'if' statement on line 237 because the body of if live_cache_size is None: was dedented out of _prefill_cache. This blocks the engine from starting for any configuration, before TurboQuant is even selected.
Useful? React with 👍 / 👎.
Summary
Add TurboQuant+ PolarQuant KV cache compression support to the MLX engine backend. This implements near-lossless KV cache compression using PolarQuant (AISTATS 2026, arXiv:2502.02617) with Walsh-Hadamard rotation and Lloyd-Max optimal scalar quantization.
Changes
New:
mlx_engine/utils/turboquant.pymake_wh_rotation()/apply_wh_rotation()— WHT rotation with random sign flipslloyd_max_centroids()— precomputed optimal centroids for N(0, 1/d) distributionquantize_polar()/dequantize_polar()— PolarQuant compression/decompressionTurboQuantKVCache— drop-in compatible KVCache with inline PolarQuant compressionModified:
mlx_engine/utils/kv_cache_quantization.pyget_kv_cache_quantization_params()now accepts string turbo modesVALID_TURBO_MODES = ("turbo2", "turbo3", "turbo4")kv_bitsis a string like"turbo3", returns it as-is for the pipeline to handleModified:
mlx_engine/cache_wrapper.py_prefill_cache(), whenkv_bitsis a turbo string, apply PolarQuant instead of MLX native quantization_apply_turboquant_to_cache()replaces cache entries with quantized versionsCompression Ratios
turbo2: 2-bit PolarQuant, ~6.4x compressionturbo3: 3-bit PolarQuant, ~4.6x compressionturbo4: 4-bit PolarQuant, ~3.8x compressionReferences
Testing
Requires Apple Silicon hardware with MLX. Test with kv_bits="turbo3" in the model load config.