Curated subset of #90 + K-side norm accounting fix#91
Open
Conversation
Subset of @brosequist's #90 commit 0fd5de9 — keeping the actual fixes, deferring the streaming + serialization API surface until a production caller exists. Included: - KVCacheCompressor.memory_stats() was omitting the float32 norm stored per V vector, inflating reported compression ratio. Adds v_bits_total += n_vectors * 32. - TurboQuantMSE.compressed_size_bits() — was missing (TurboQuant already had it). - Replaces seed + 1000 magic offset with np.random.SeedSequence(seed).spawn(2) for true PRNG independence between PolarQuant and QJL stages, and between K and V quantizers. Deferred (not in this commit): - compress_token() / get_compressed_cache() streaming API - CompressedVector.to_bytes() / from_bytes() binary serialization - CompressedKVCache.save() / load() npz serialization
…uant The existing test ended with a print() and no assertion, silently allowing QJL to be worse than PolarQuant. This updates the test to assert the known finding: QJL (TurboQuant 2-bit) is actively worse than MSE-only PolarQuant at the same bit budget. The assertion will alert if QJL is ever fixed and starts winning, prompting re-evaluation of the production path. See turbo4-resurrection.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TestFastRotationExtended covers: round-trip invertibility (x → rotate → unrotate = x), batch vs single-vector consistency, and energy distribution uniformity after rotation. All three property tests were previously untested. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a [tool.ruff] section to pyproject.toml (line-length=120, E/W/F rules, ignoring E501/E741) and a GitHub Actions workflow (.github/workflows/lint.yml) that runs ruff check on every push and pull request. Replaces ad-hoc style discussions with an enforced, zero-config lint gate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The lint workflow added in 46efe26 ran 'ruff check .' against the whole repo and failed immediately because the existing codebase has 233 pre-existing ruff violations (78 F401 unused imports, 68 I001 import sorting, 40 F541 empty f-strings, 32 F841 unused vars, etc.) across benchmarks/ and scripts/. Adding a CI gate that the legacy code doesn't pass is unhelpful, so remove .github/workflows/lint.yml. Keep the [tool.ruff] block in pyproject.toml as opt-in documentation: anyone running 'ruff check' locally still gets the configured rules, and the workflow can be re-enabled later once the legacy violations are addressed (most are auto-fixable via 'ruff check --fix' across 187 of the 233).
TurboQuant.CompressedVector stores TWO float32 norms per vector (vector_norms = ||x||_2, residual_norms = ||residual||_2), but compressed_size_bits and KVCacheCompressor.memory_stats only counted one (32 bits instead of 64). Pre-existing on main, parallel to the V-side undercount fixed in the previous commit. V uses TurboQuantMSE which stores a single norm — 32 is correct there. K uses full TurboQuant which stores two norms. Effect: K compressed size was understated by 32 bits per vector, inflating reported compression ratio. With d=128 b=3 the TurboQuant ratio drops from 4.92× → 4.57× (true value), and the combined KV ratio at d=128 k=v=3 drops from ~2.46× → ~2.37×. No quantization-output changes, accounting only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Selectively merging from @brosequist's #90 — keeping the actual fixes, deferring API surface, plus an additional K-side fix that #90 missed.
Thanks @brosequist for the bundle — review surface was much easier this way. Re-bundling further per the curation below.
What's included
b75813bmemory_stats, SeedSequence PRNG,TurboQuantMSE.compressed_size_bits1074625test_turboquant_improves_over_polarquantf23570e3e375720ca5bcc8afc4bfBrett's first commit was split — credit preserved
The original
0fd5de9bundle (5 changes) was cherry-picked with the streaming + serialization API deferred. What's kept:KVCacheCompressor.memory_stats()TurboQuantMSE.compressed_size_bits()(was missing;TurboQuantalready had it)SeedSequence.spawn(2)replacingseed + 1000magic offsetWhat's deferred (no caller yet, want to design for the production integration):
KVCacheCompressor.compress_token()/get_compressed_cache()streaming APICompressedVector.to_bytes()/from_bytes()binary serializationCompressedKVCache.save()/load()npz serializationThe split commit retains @brosequist as author.
What's added
8afc4bfis a parallel fix to the V-norm fix inb75813b.TurboQuant.CompressedVectorstores two float32 norms (vector_norms = ||x||_2andresidual_norms = ||residual||_2), butTurboQuant.compressed_size_bitsandKVCacheCompressor.memory_statsonly counted one. V usesTurboQuantMSE(single norm — 32 is correct). K uses fullTurboQuant(two norms — 64 is correct).Numerical effect (verified live):
TurboQuant(d=128, b=3).compression_ratio(): 4.92× → 4.57× (true)KVCacheCompressor(d=128, k=3, v=3).memory_stats(...)['compression_ratio']: ~2.46× → 2.37× (true)No quantization-output changes — accounting only.
What's not included from #90
feat: add calibrate() to OutlierTurboQuant—OutlierTurboQuantis a deprecated path perdocs/turboquant-plus-experiments.md("Outlier channeling doesn't work… kurtosis stays 8-50… WHT rotation gets it to 2.9"). Calibrate code is well-written, just on a dead module.docs: HIP/AMD NaN warning— root-cause story ("large K norms → NaN") contradictsdocs/papers/asymmetric-kv-compression.md:218, which finds extreme K norms compress better (more Gaussian after normalization). Real cause is HIP-kernel-specific. Will revisit after kernel triage.Consistency with
docs/papers/why-mse-fails-for-kv-quantization.mdThe new MSE paper argues MSE is a broken proxy for K cache quantization in deployment because attention is non-linear and sparse. Brett's QJL regression-guard test (
1074625) measures inner-product distortion on synthetic Gaussian pairs (d=256) — the linear-operator regime where the new paper explicitly says IP/MSE does proxy quality (alongside RaBitQ-style top-k IP search). The test is a regression-guard for IP distortion only; the production decision to drop QJL is justified separately bydocs/papers/turbo4-resurrection.md's PPL ablation. No conflict.Test plan
pytest tests/ refract/tests/— 982 passed, 1 skipped (7 fewer than Bundled fixes & tests: V-norm accounting, OutlierTurboQuant.calibrate, rotation tests, ruff CI, HIP/AMD NaN docs #90 baseline because streaming/serialization tests deferred with their code)TurboQuant.compression_ratio()andKVCacheCompressor.memory_stats()numerics against expected values🤖 Generated with Claude Code