Summary
turbo3 KV cache shows a clean +12% TG improvement over q8_0 in single-stream tests on Jetson Xavier AGX (sm_72, CUDA 11.4), but regresses -12% to -22% under --parallel >= 4 with longer prompts. The regression scales linearly with prompt length. q8_0 shows no equivalent regression.
Hardware / build context
- Jetson Xavier AGX (sm_72 Volta, 30 GB unified memory)
- CUDA 11.4.315, gcc 9.4
- Branch:
feature/triattention-scoring at f9a308d
- Build:
cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=72 -DGGML_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release — clean, all turbo fattn-vec instances compiled
Reproduction
Two binaries built from the same source. Same model (Qwen3.5-0.8B-Q4_K_M) for the small-model isolation, Qwen3.6-35B-A3B-UD-Q4_K_XL for the production-size validation.
Single-stream (llama-bench -p 1024 -n 128 -r 3)
| KV (K/V) |
TG t/s |
| q8_0 / q8_0 |
44.33 |
| turbo3 / turbo3 |
49.66 (+12%) |
turbo3 clearly wins — Sparse V skip path is active and helping.
Multi-slot (llama-server --parallel 8, single client, same prompt)
| KV (K/V) |
TG t/s |
| q8_0 / q8_0 |
44.62 |
| turbo3 / turbo3 |
39.18 (-12%) |
Same hardware, same model, identical request — only --parallel 8 and KV-type differ.
Production-scale (Qwen3.6-35B-A3B Q4_K_XL, --parallel 8, --mlock, c=524288)
| Prompt size |
KV |
TG t/s |
| 2 tokens |
turbo3 |
15.52 |
| 200 tokens |
turbo3 |
15.12 |
| 562 tokens |
turbo3 |
14.15 |
| 1.7K tokens |
turbo3 |
12.01 |
| (any) |
turbo2 [the legacy 8-bit alias upstream] |
~15 |
Linear degradation with prompt length under multi-slot. Single-stream the same model holds 15+ across prompt sizes.
Failed hypothesis (sharing so it isn't repeated)
I hypothesized the Sparse V skip in fattn-vec.cuh lines 349/395 was the cause — if (dominated) { continue; } triggers warp divergence, and the code already disables this on HIP/RDNA3 (commit ae9814cd5, "branch overhead exceeds dequant savings"). I patched the guard to also disable it on __CUDA_ARCH__ < 800:
#if !defined(GGML_USE_HIP) && (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800)
Result was the opposite of what I expected. Disabling Sparse V on Volta caused single-stream turbo3 TG to drop from 49.66 → 35.06 (-29%, ±8.81), while q8_0 held at 44.11. So Sparse V is what gives turbo3 its single-stream win on Volta — turning it off forces full V dequant on every position. Reverted.
This means the multi-slot regression is not caused by Sparse V divergence cost. The fact that single-stream turbo3 runs the Sparse V path more aggressively (more skips at long context) yet still beats q8_0 single-stream suggests the divergence cost itself is fine on sm_72.
Remaining hypotheses (not tested — would need profiler access)
kv_unified=false per-slot KV layout interaction with turbo3's structured rotation per row. Multi-slot KV cells are in disjoint sub-buffers, possibly hurting cache locality more for turbo3's WHT-rotated reads than q8_0's simpler row layout.
- K-side attention sharpening (
α=1.0357 per the log) applied per-slot — wondering if it's redundant work in the multi-slot path.
- WHT rotation matrix in
turbo-wht.cu — maybe per-slot init not amortized, or shared-memory contention scales worse than q8_0 dequant.
I don't have nsight-compute on a Jetson + CUDA 11.4 toolchain, so I can't profile to pin which kernel is the actual hot path.
Asks
- Could you confirm whether you've tested
--parallel >= 4 with turbo3 on any hardware? README benchmarks (TURBOQUANT.md) appear to be single-stream RX 7900 XTX.
- If you have profiling data on Ampere with multi-slot, does
turbo3 show similar prompt-length-dependent TG degradation, or is this Volta-specific?
- Suggested investigation path? Happy to run any
nsys/ncu traces if you have a profiling-capable build to send.
Single-stream turbo3 works great on Volta — the underlying kernels are fine. The issue is specifically multi-slot serving with growing KV cache, which is the production deployment target. Reproduces 100% on this hardware.
Why this matters for Jetson deployment
Production prod uses --parallel 8 with c=524288 to serve multiple users. Switching q8_0 -> turbo3 would free ~6 GB of unified memory (5.12× compression vs f16), but the TG regression on long conversations cancels the upside. If we can fix the multi-slot path, this becomes a clean win for edge deployment on 30 GB unified-memory Jetson hardware.
Summary
turbo3KV cache shows a clean +12% TG improvement overq8_0in single-stream tests on Jetson Xavier AGX (sm_72, CUDA 11.4), but regresses -12% to -22% under--parallel >= 4with longer prompts. The regression scales linearly with prompt length.q8_0shows no equivalent regression.Hardware / build context
feature/triattention-scoringatf9a308dcmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=72 -DGGML_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release— clean, all turbo fattn-vec instances compiledReproduction
Two binaries built from the same source. Same model (Qwen3.5-0.8B-Q4_K_M) for the small-model isolation, Qwen3.6-35B-A3B-UD-Q4_K_XL for the production-size validation.
Single-stream (
llama-bench -p 1024 -n 128 -r 3)turbo3clearly wins — Sparse V skip path is active and helping.Multi-slot (
llama-server --parallel 8, single client, same prompt)Same hardware, same model, identical request — only
--parallel 8and KV-type differ.Production-scale (Qwen3.6-35B-A3B Q4_K_XL, --parallel 8, --mlock, c=524288)
Linear degradation with prompt length under multi-slot. Single-stream the same model holds 15+ across prompt sizes.
Failed hypothesis (sharing so it isn't repeated)
I hypothesized the Sparse V skip in
fattn-vec.cuhlines 349/395 was the cause —if (dominated) { continue; }triggers warp divergence, and the code already disables this on HIP/RDNA3 (commitae9814cd5, "branch overhead exceeds dequant savings"). I patched the guard to also disable it on__CUDA_ARCH__ < 800:Result was the opposite of what I expected. Disabling Sparse V on Volta caused single-stream
turbo3TG to drop from 49.66 → 35.06 (-29%, ±8.81), while q8_0 held at 44.11. So Sparse V is what givesturbo3its single-stream win on Volta — turning it off forces full V dequant on every position. Reverted.This means the multi-slot regression is not caused by Sparse V divergence cost. The fact that single-stream
turbo3runs the Sparse V path more aggressively (more skips at long context) yet still beats q8_0 single-stream suggests the divergence cost itself is fine on sm_72.Remaining hypotheses (not tested — would need profiler access)
kv_unified=falseper-slot KV layout interaction with turbo3's structured rotation per row. Multi-slot KV cells are in disjoint sub-buffers, possibly hurting cache locality more forturbo3's WHT-rotated reads thanq8_0's simpler row layout.α=1.0357per the log) applied per-slot — wondering if it's redundant work in the multi-slot path.turbo-wht.cu— maybe per-slot init not amortized, or shared-memory contention scales worse than q8_0 dequant.I don't have nsight-compute on a Jetson + CUDA 11.4 toolchain, so I can't profile to pin which kernel is the actual hot path.
Asks
--parallel >= 4withturbo3on any hardware? README benchmarks (TURBOQUANT.md) appear to be single-stream RX 7900 XTX.turbo3show similar prompt-length-dependent TG degradation, or is this Volta-specific?nsys/ncutraces if you have a profiling-capable build to send.Single-stream
turbo3works great on Volta — the underlying kernels are fine. The issue is specifically multi-slot serving with growing KV cache, which is the production deployment target. Reproduces 100% on this hardware.Why this matters for Jetson deployment
Production prod uses
--parallel 8withc=524288to serve multiple users. Switchingq8_0->turbo3would free ~6 GB of unified memory (5.12× compression vs f16), but the TG regression on long conversations cancels the upside. If we can fix the multi-slot path, this becomes a clean win for edge deployment on 30 GB unified-memory Jetson hardware.