Summary
The feature/triattention-scoring branch wires up the --triattention <path> flag and ships a complete loader (src/triattention.c::tria_load, magic 0x54524941, versions 1/2/3 supported), but the repository does not include a calibration script that produces the stats.bin file the loader expects.
The README/TURBOQUANT.md reports results like "Qwen3.5-27B turbo3 + TriAtt 75%" — so a calibration pipeline clearly exists internally. Could it be published?
What's documented
src/triattention-integration.md describes the algorithm (CPU-side scoring every 128 tokens, GQA aggregation, z-norm, top-K). The header documents the on-disk schema (layer count, kv-head count, fc, head_dim, optional layer_budget_scales, per-head MRL and non-rot data). Sufficient to reverse-engineer the format, but not the calibration math.
Why it matters
Without the producer, third parties (including users on non-AMD hardware trying to validate the approach — sm_75 / sm_72 in our case) can build the binary but not exercise the TriAttention path end-to-end. We end up only able to compare TurboQuant variants.
Asks (in order of preference)
- Publish the script as-is (even if it's research-grade / undocumented).
- Document the math sufficient for a third-party reimplementation: how Q/K concentration centers are computed, what calibration corpus you used, MRL bin definitions, what
nonrot_alpha modulates.
- Confirm format details for a community reimplementation: byte order, per-head record layout, how
layer_budget_scales is computed, version differences (v1 vs v2 vs v3).
Context
Trying to evaluate feasibility of the TriAttention + TurboQuant path on NVIDIA Volta-class hardware (Jetson Xavier AGX, sm_72) for an LLM cluster. Happy to share results back if we get it working on Turing/Volta.
Thanks for the work — the turbo3 + TriAtt numbers are striking.
Summary
The
feature/triattention-scoringbranch wires up the--triattention <path>flag and ships a complete loader (src/triattention.c::tria_load, magic0x54524941, versions 1/2/3 supported), but the repository does not include a calibration script that produces thestats.binfile the loader expects.The README/TURBOQUANT.md reports results like "Qwen3.5-27B turbo3 + TriAtt 75%" — so a calibration pipeline clearly exists internally. Could it be published?
What's documented
src/triattention-integration.mddescribes the algorithm (CPU-side scoring every 128 tokens, GQA aggregation, z-norm, top-K). The header documents the on-disk schema (layer count, kv-head count, fc, head_dim, optional layer_budget_scales, per-head MRL and non-rot data). Sufficient to reverse-engineer the format, but not the calibration math.Why it matters
Without the producer, third parties (including users on non-AMD hardware trying to validate the approach — sm_75 / sm_72 in our case) can build the binary but not exercise the TriAttention path end-to-end. We end up only able to compare TurboQuant variants.
Asks (in order of preference)
nonrot_alphamodulates.layer_budget_scalesis computed, version differences (v1 vs v2 vs v3).Context
Trying to evaluate feasibility of the TriAttention + TurboQuant path on NVIDIA Volta-class hardware (Jetson Xavier AGX, sm_72) for an LLM cluster. Happy to share results back if we get it working on Turing/Volta.
Thanks for the work — the turbo3 + TriAtt numbers are striking.