Bundled fixes & tests: V-norm accounting, OutlierTurboQuant.calibrate, rotation tests, ruff CI, HIP/AMD NaN docs by brosequist · Pull Request #90 · TheTom/turboquant_plus

brosequist · 2026-05-09T08:44:41Z

Hi @TheTom — thanks for the friendly note on #61 back in April. I'd left the original six PRs (#61, #62, #63, #64, #65, #66) sitting open for a few weeks and decided to close them today and rebundle here as a single PR, hoping the smaller review surface helps you triage when you have time. All six commits still apply cleanly against main (zero rebase needed) and are preserved as separate commits in this branch so the individual rationale and git blame story stay intact.

Closing this PR if you'd rather see them re-opened individually is also fine — happy to follow whatever workflow works for you.

What's in this PR (6 commits, original PR refs in parens)

1. `fix: V-norm in memory_stats, SeedSequence PRNG, streaming API, serialization` (was #61)

KVCacheCompressor.memory_stats() was omitting the 32-bit float norm stored per V vector, inflating the reported compression ratio. Adds v_bits_total += n_vectors * 32.
Adds compressed_size_bits() to TurboQuantMSE (was missing; TurboQuant already had it).
Replaces seed + 1000 offset with np.random.SeedSequence(seed).spawn(2) for true PRNG independence between the PolarQuant and QJL stages.
Adds compress_token() / get_compressed_cache() streaming API to KVCacheCompressor for auto-regressive token-by-token inference.
Adds CompressedVector.to_bytes() / from_bytes() for disk / network serialisation.

2. `test: document QJL regression in test_turboquant_improves_over_polarquant` (was #62)

The existing test had no assertion — it only print()'d, silently allowing QJL to be worse than PolarQuant. Adds a regression-guard assertion documenting the empirical finding (TQ 2-bit avg ≈ 0.091 vs PQ 2-bit avg ≈ 0.041 inner-product distortion). If QJL is ever fixed to actually improve over PQ, the test will fail loudly and prompt re-evaluation of the production path.

3. `test: add correctness and round-trip tests for fast rotation functions` (was #63)

Three property tests for fast_rotate / fast_unrotate (none of which existed previously):
1. Round-trip invertibility — fast_unrotate(fast_rotate(x)) ≈ x
2. Batch consistency — row-by-row equals all-at-once
3. Energy distribution — roughly uniform per-coordinate variance after rotation

4. `feat: add calibrate() to OutlierTurboQuant for data-driven channel split` (was #64)

OutlierTurboQuant.calibrate(calibration_vectors) computes per-channel RMS across a calibration set and marks channels whose RMS exceeds 3× the median as outlier channels, updating the compressor's split in place.
Follows the dynamic-threshold approach from the LLM.int8() / SmoothQuant literature.

5. `chore: add ruff linting to pyproject.toml and CI workflow` (was #65)

[tool.ruff] block in pyproject.toml (line-length=120, E/W/F, ignoring E501/E741).
.github/workflows/lint.yml runs ruff check on push / PR.
Pure tooling — zero behavioural changes.

6. `docs: add HIP/AMD NaN warning for q8_0/turbo3 on large K-norm models` (was #66)

Prominent warning block in docs/turboquant-recommendations.md documenting observed NaN divergence when using q8_0 or turbo3 on models with large K-vector norms (e.g. Qwen2.5-7B) on AMD/ROCm (HIP) backends. Recommends turbo2 / turbo4 or pre-quantisation K-norm clipping.
Likely related to the user-reported symptoms in [ROCm] Scale operation fails with "invalid device function" during Gemma 4 loading #86 ("[ROCm] Scale operation fails... during Gemma 4 loading") and [Bug] Numerical instability (NaN/Infinite Loop) with Qwen 2.5/3.5 models using turbo4 on RTX 4070Ti #60 ("Numerical instability with Qwen 2.5/3.5 models using turbo4").

Test plan

pytest tests/test_kv_cache.py — covers V-norm accounting, streaming API, serialisation round-trip
pytest tests/test_distortion.py::TestDistortionScaling::test_turboquant_improves_over_polarquant — QJL regression assertion
pytest tests/test_rotation.py — fast-rotation property tests
pytest tests/test_outlier.py — calibrate() plus all-inlier / all-outlier edge cases
ruff check . — passes (and the new GH Actions workflow runs it on every push)
Docs-only changes (Implement full TurboQuant (turboquant.py) — Algorithm 2 #6) — nothing to test

🤖 Generated with Claude Code

…essed_size_bits KVCacheCompressor.memory_stats() omitted the float32 norm stored per V vector, inflating the reported compression ratio. Add v_bits_total += n_vectors * 32 to account for it. Also adds compressed_size_bits() to TurboQuantMSE (was missing; TurboQuant already had it), fixing the asymmetry between the two classes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uant The existing test ended with a print() and no assertion, silently allowing QJL to be worse than PolarQuant. This updates the test to assert the known finding: QJL (TurboQuant 2-bit) is actively worse than MSE-only PolarQuant at the same bit budget. The assertion will alert if QJL is ever fixed and starts winning, prompting re-evaluation of the production path. See turbo4-resurrection.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TestFastRotationExtended covers: round-trip invertibility (x → rotate → unrotate = x), batch vs single-vector consistency, and energy distribution uniformity after rotation. All three property tests were previously untested. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously the outlier/inlier channel split was set at construction time and never adjusted. calibrate(calibration_vectors) now computes per-channel RMS, flags channels whose RMS exceeds 3× the median as outliers, and updates the split on the compressor — matching the dynamic-threshold approach described in the LLM.int8() and SmoothQuant literature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a [tool.ruff] section to pyproject.toml (line-length=120, E/W/F rules, ignoring E501/E741) and a GitHub Actions workflow (.github/workflows/lint.yml) that runs ruff check on every push and pull request. Replaces ad-hoc style discussions with an enforced, zero-config lint gate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a prominent WARNING block to turboquant-recommendations.md documenting the observed NaN divergence when using q8_0 or turbo3 compression on models with large K-vector norms (e.g. Qwen2.5-7B) on AMD/ROCm (HIP) backends. The root cause is the int8 overflow path that differs between HIP and CUDA. Recommended mitigations: switch to turbo2/turbo4 or add pre-quantization K-norm clipping. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The lint workflow added in 46efe26 ran 'ruff check .' against the whole repo and failed immediately because the existing codebase has 233 pre-existing ruff violations (78 F401 unused imports, 68 I001 import sorting, 40 F541 empty f-strings, 32 F841 unused vars, etc.) across benchmarks/ and scripts/. Adding a CI gate that the legacy code doesn't pass is unhelpful, so remove .github/workflows/lint.yml. Keep the [tool.ruff] block in pyproject.toml as opt-in documentation: anyone running 'ruff check' locally still gets the configured rules, and the workflow can be re-enabled later once the legacy violations are addressed (most are auto-fixable via 'ruff check --fix' across 187 of the 233).

@brosequist

Subset of @brosequist's #90 commit 0fd5de9 — keeping the actual fixes, deferring the streaming + serialization API surface until a production caller exists. Included: - KVCacheCompressor.memory_stats() was omitting the float32 norm stored per V vector, inflating reported compression ratio. Adds v_bits_total += n_vectors * 32. - TurboQuantMSE.compressed_size_bits() — was missing (TurboQuant already had it). - Replaces seed + 1000 magic offset with np.random.SeedSequence(seed).spawn(2) for true PRNG independence between PolarQuant and QJL stages, and between K and V quantizers. Deferred (not in this commit): - compress_token() / get_compressed_cache() streaming API - CompressedVector.to_bytes() / from_bytes() binary serialization - CompressedKVCache.save() / load() npz serialization

TheTom · 2026-05-09T15:49:27Z

hey @brosequist, first off, big apology for the delay on these. you opened the originals back in april, i sat on them way too long, and the rebundle made it much easier to review. really appreciate the patience and the diligence on the rebundle work.

i landed a curated subset in #91 with you as author on the cherry-picks. quick rundown:

merging from #90 (you authored, cherry-picked):

✅ V-norm fix in memory_stats + TurboQuantMSE.compressed_size_bits. split out of 0fd5de9 to keep just the fixes. real bug, accounting was off.
✅ SeedSequence PRNG cleanup. also split from 0fd5de9. cleaner than the magic offset.
✅ QJL regression-guard test. locks the production reality from docs/papers/turbo4-resurrection.md. good test.
✅ rotation property tests. pure additive, real coverage.
✅ ruff config (kept config, dropped the workflow per your follow-up).

deferred from #90:

⏸ streaming API (compress_token / get_compressed_cache) and binary/npz serialization (to_bytes / from_bytes / save / load). split out of 0fd5de9. solid code, but i don't have a production caller yet and want to design these for whatever ends up wiring them up. holding until that lands.
⏸ OutlierTurboQuant.calibrate(). the calibrate code itself is clean, but OutlierTurboQuant is a dead path on my end per docs/turboquant-plus-experiments.md (kurtosis stays high after channel removal, WHT rotation handles tails better, WUSH paper confirmed). holding for now.
⏸ HIP/AMD NaN docs. the root-cause story (large K norms causing NaN) actually contradicts what i'm seeing in docs/papers/asymmetric-kv-compression.md, which finds extreme K norms compress better because the post-normalization distribution becomes more Gaussian (boundary layers are ideal for Lloyd-Max). real cause is HIP-kernel-specific. happy to revisit after kernel triage.

also added a parallel K-norm accounting fix on top of yours in #91. compressed_size_bits for TurboQuant K was undercounting too (it stores two norms, not one), so #91 has both sides corrected. you flagged the V side, that pulled my attention to the K side.

thanks again for sticking with this. let me know if anything in the curation feels off, or if you'd like to take another swing at any of the deferred items with the production-caller / kernel context in mind.

brett and others added 7 commits May 9, 2026 04:20

TheTom mentioned this pull request May 9, 2026

Curated subset of #90 + K-side norm accounting fix #91

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bundled fixes & tests: V-norm accounting, OutlierTurboQuant.calibrate, rotation tests, ruff CI, HIP/AMD NaN docs#90

Bundled fixes & tests: V-norm accounting, OutlierTurboQuant.calibrate, rotation tests, ruff CI, HIP/AMD NaN docs#90
brosequist wants to merge 7 commits intoTheTom:mainfrom
brosequist:main

brosequist commented May 9, 2026

Uh oh!

TheTom commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

brosequist commented May 9, 2026

What's in this PR (6 commits, original PR refs in parens)

1. fix: V-norm in memory_stats, SeedSequence PRNG, streaming API, serialization (was #61)

2. test: document QJL regression in test_turboquant_improves_over_polarquant (was #62)

3. test: add correctness and round-trip tests for fast rotation functions (was #63)

4. feat: add calibrate() to OutlierTurboQuant for data-driven channel split (was #64)

5. chore: add ruff linting to pyproject.toml and CI workflow (was #65)

6. docs: add HIP/AMD NaN warning for q8_0/turbo3 on large K-norm models (was #66)

Test plan

Uh oh!

TheTom commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `fix: V-norm in memory_stats, SeedSequence PRNG, streaming API, serialization` (was #61)

2. `test: document QJL regression in test_turboquant_improves_over_polarquant` (was #62)

3. `test: add correctness and round-trip tests for fast rotation functions` (was #63)

4. `feat: add calibrate() to OutlierTurboQuant for data-driven channel split` (was #64)

5. `chore: add ruff linting to pyproject.toml and CI workflow` (was #65)

6. `docs: add HIP/AMD NaN warning for q8_0/turbo3 on large K-norm models` (was #66)