Skip to content

Add TurboQuant PolarQuant KV cache compression (turbo2/turbo3/turbo4)#332

Closed
ShrekDino wants to merge 1 commit into
lmstudio-ai:mainfrom
ShrekDino:add-turboquant-kv-compression
Closed

Add TurboQuant PolarQuant KV cache compression (turbo2/turbo3/turbo4)#332
ShrekDino wants to merge 1 commit into
lmstudio-ai:mainfrom
ShrekDino:add-turboquant-kv-compression

Conversation

@ShrekDino

Copy link
Copy Markdown

Summary

Add TurboQuant+ PolarQuant KV cache compression support to the MLX engine backend. This implements near-lossless KV cache compression using PolarQuant (AISTATS 2026, arXiv:2502.02617) with Walsh-Hadamard rotation and Lloyd-Max optimal scalar quantization.

Changes

New: mlx_engine/utils/turboquant.py

  • Fast Walsh-Hadamard Transform implementation using MLX ops (O(d log d))
  • make_wh_rotation() / apply_wh_rotation() — WHT rotation with random sign flips
  • lloyd_max_centroids() — precomputed optimal centroids for N(0, 1/d) distribution
  • quantize_polar() / dequantize_polar() — PolarQuant compression/decompression
  • TurboQuantKVCache — drop-in compatible KVCache with inline PolarQuant compression

Modified: mlx_engine/utils/kv_cache_quantization.py

  • get_kv_cache_quantization_params() now accepts string turbo modes
  • New VALID_TURBO_MODES = ("turbo2", "turbo3", "turbo4")
  • When kv_bits is a string like "turbo3", returns it as-is for the pipeline to handle

Modified: mlx_engine/cache_wrapper.py

  • Import turboquant module
  • In _prefill_cache(), when kv_bits is a turbo string, apply PolarQuant instead of MLX native quantization
  • Helper _apply_turboquant_to_cache() replaces cache entries with quantized versions

Compression Ratios

  • turbo2: 2-bit PolarQuant, ~6.4x compression
  • turbo3: 3-bit PolarQuant, ~4.6x compression
  • turbo4: 4-bit PolarQuant, ~3.8x compression

References

Testing

Requires Apple Silicon hardware with MLX. Test with kv_bits="turbo3" in the model load config.

Implements TurboQuant+ KV cache compression for MLX:
- PolarQuant: norm extraction + WHT rotation + Lloyd-Max scalar quantization
- Fast Walsh-Hadamard Transform for O(d log d) rotation
- Precomputed Lloyd-Max centroids for N(0, 1/d) distribution
- TurboQuantKVCache class compatible with MLX's KVCache interface

The cache stores quantized indices (uint8) + norms (fp32) instead of
full-precision keys/values, achieving:
- turbo2: 2-bit, ~6.4x compression
- turbo3: 3-bit, ~4.6x compression
- turbo4: 4-bit, ~3.8x compression

See github.com/TheTom/turboquant_plus for the reference implementation.
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


ShrekDino seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30ece36b94

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if live_cache_size is None:
self._live_tokens = None
self._live_cache = self._make_cache()
self._live_tokens = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore cancellation branch indentation

Importing cache_wrapper.py now fails with IndentationError: expected an indented block after 'if' statement on line 237 because the body of if live_cache_size is None: was dedented out of _prefill_cache. This blocks the engine from starting for any configuration, before TurboQuant is even selected.

Useful? React with 👍 / 👎.

@yagil yagil closed this Jun 1, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 1, 2026
@github-actions github-actions Bot added the CLA signed Indicates that all contributors have signed label Jun 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA signed Indicates that all contributors have signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants