Add TurboQuant PolarQuant KV cache compression (turbo2/turbo3/turbo4) by ShrekDino · Pull Request #332 · lmstudio-ai/mlx-engine

ShrekDino · 2026-06-01T07:04:49Z

Summary

Add TurboQuant+ PolarQuant KV cache compression support to the MLX engine backend. This implements near-lossless KV cache compression using PolarQuant (AISTATS 2026, arXiv:2502.02617) with Walsh-Hadamard rotation and Lloyd-Max optimal scalar quantization.

Changes

New: `mlx_engine/utils/turboquant.py`

Fast Walsh-Hadamard Transform implementation using MLX ops (O(d log d))
make_wh_rotation() / apply_wh_rotation() — WHT rotation with random sign flips
lloyd_max_centroids() — precomputed optimal centroids for N(0, 1/d) distribution
quantize_polar() / dequantize_polar() — PolarQuant compression/decompression
TurboQuantKVCache — drop-in compatible KVCache with inline PolarQuant compression

Modified: `mlx_engine/utils/kv_cache_quantization.py`

get_kv_cache_quantization_params() now accepts string turbo modes
New VALID_TURBO_MODES = ("turbo2", "turbo3", "turbo4")
When kv_bits is a string like "turbo3", returns it as-is for the pipeline to handle

Modified: `mlx_engine/cache_wrapper.py`

Import turboquant module
In _prefill_cache(), when kv_bits is a turbo string, apply PolarQuant instead of MLX native quantization
Helper _apply_turboquant_to_cache() replaces cache entries with quantized versions

Compression Ratios

turbo2: 2-bit PolarQuant, ~6.4x compression
turbo3: 3-bit PolarQuant, ~4.6x compression
turbo4: 4-bit PolarQuant, ~3.8x compression

References

TurboQuant paper: arXiv:2504.19874 (ICLR 2026)
PolarQuant paper: arXiv:2502.02617 (AISTATS 2026)
Reference C++/Metal implementation: https://github.com/TheTom/turboquant_plus
Original Python research impl: https://github.com/0xSero/turboquant

Testing

Requires Apple Silicon hardware with MLX. Test with kv_bits="turbo3" in the model load config.

Implements TurboQuant+ KV cache compression for MLX: - PolarQuant: norm extraction + WHT rotation + Lloyd-Max scalar quantization - Fast Walsh-Hadamard Transform for O(d log d) rotation - Precomputed Lloyd-Max centroids for N(0, 1/d) distribution - TurboQuantKVCache class compatible with MLX's KVCache interface The cache stores quantized indices (uint8) + norms (fp32) instead of full-precision keys/values, achieving: - turbo2: 2-bit, ~6.4x compression - turbo3: 3-bit, ~4.6x compression - turbo4: 4-bit, ~3.8x compression See github.com/TheTom/turboquant_plus for the reference implementation.

github-actions · 2026-06-01T07:05:03Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

ShrekDino seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30ece36b94

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-01T07:06:14Z

                if live_cache_size is None:
-                    self._live_tokens = None
-                    self._live_cache = self._make_cache()
+        self._live_tokens = None


Restore cancellation branch indentation

Importing cache_wrapper.py now fails with IndentationError: expected an indented block after 'if' statement on line 237 because the body of if live_cache_size is None: was dedented out of _prefill_cache. This blocks the engine from starting for any configuration, before TurboQuant is even selected.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

yagil closed this Jun 1, 2026

github-actions Bot locked and limited conversation to collaborators Jun 1, 2026

github-actions Bot added the CLA signed Indicates that all contributors have signed label Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TurboQuant PolarQuant KV cache compression (turbo2/turbo3/turbo4)#332

Add TurboQuant PolarQuant KV cache compression (turbo2/turbo3/turbo4)#332
ShrekDino wants to merge 1 commit into
lmstudio-ai:mainfrom
ShrekDino:add-turboquant-kv-compression

ShrekDino commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShrekDino commented Jun 1, 2026

Summary

Changes

New: mlx_engine/utils/turboquant.py

Modified: mlx_engine/utils/kv_cache_quantization.py

Modified: mlx_engine/cache_wrapper.py

Compression Ratios

References

Testing

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New: `mlx_engine/utils/turboquant.py`

Modified: `mlx_engine/utils/kv_cache_quantization.py`

Modified: `mlx_engine/cache_wrapper.py`