Skip to content

turbo3 KV multi-slot TG regression on NVIDIA Volta sm_72 — prompt-length dependent #11

@minhtcai

Description

@minhtcai

Summary

turbo3 KV cache shows a clean +12% TG improvement over q8_0 in single-stream tests on Jetson Xavier AGX (sm_72, CUDA 11.4), but regresses -12% to -22% under --parallel >= 4 with longer prompts. The regression scales linearly with prompt length. q8_0 shows no equivalent regression.

Hardware / build context

  • Jetson Xavier AGX (sm_72 Volta, 30 GB unified memory)
  • CUDA 11.4.315, gcc 9.4
  • Branch: feature/triattention-scoring at f9a308d
  • Build: cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=72 -DGGML_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release — clean, all turbo fattn-vec instances compiled

Reproduction

Two binaries built from the same source. Same model (Qwen3.5-0.8B-Q4_K_M) for the small-model isolation, Qwen3.6-35B-A3B-UD-Q4_K_XL for the production-size validation.

Single-stream (llama-bench -p 1024 -n 128 -r 3)

KV (K/V) TG t/s
q8_0 / q8_0 44.33
turbo3 / turbo3 49.66 (+12%)

turbo3 clearly wins — Sparse V skip path is active and helping.

Multi-slot (llama-server --parallel 8, single client, same prompt)

KV (K/V) TG t/s
q8_0 / q8_0 44.62
turbo3 / turbo3 39.18 (-12%)

Same hardware, same model, identical request — only --parallel 8 and KV-type differ.

Production-scale (Qwen3.6-35B-A3B Q4_K_XL, --parallel 8, --mlock, c=524288)

Prompt size KV TG t/s
2 tokens turbo3 15.52
200 tokens turbo3 15.12
562 tokens turbo3 14.15
1.7K tokens turbo3 12.01
(any) turbo2 [the legacy 8-bit alias upstream] ~15

Linear degradation with prompt length under multi-slot. Single-stream the same model holds 15+ across prompt sizes.

Failed hypothesis (sharing so it isn't repeated)

I hypothesized the Sparse V skip in fattn-vec.cuh lines 349/395 was the cause — if (dominated) { continue; } triggers warp divergence, and the code already disables this on HIP/RDNA3 (commit ae9814cd5, "branch overhead exceeds dequant savings"). I patched the guard to also disable it on __CUDA_ARCH__ < 800:

#if !defined(GGML_USE_HIP) && (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800)

Result was the opposite of what I expected. Disabling Sparse V on Volta caused single-stream turbo3 TG to drop from 49.66 → 35.06 (-29%, ±8.81), while q8_0 held at 44.11. So Sparse V is what gives turbo3 its single-stream win on Volta — turning it off forces full V dequant on every position. Reverted.

This means the multi-slot regression is not caused by Sparse V divergence cost. The fact that single-stream turbo3 runs the Sparse V path more aggressively (more skips at long context) yet still beats q8_0 single-stream suggests the divergence cost itself is fine on sm_72.

Remaining hypotheses (not tested — would need profiler access)

  1. kv_unified=false per-slot KV layout interaction with turbo3's structured rotation per row. Multi-slot KV cells are in disjoint sub-buffers, possibly hurting cache locality more for turbo3's WHT-rotated reads than q8_0's simpler row layout.
  2. K-side attention sharpening (α=1.0357 per the log) applied per-slot — wondering if it's redundant work in the multi-slot path.
  3. WHT rotation matrix in turbo-wht.cu — maybe per-slot init not amortized, or shared-memory contention scales worse than q8_0 dequant.

I don't have nsight-compute on a Jetson + CUDA 11.4 toolchain, so I can't profile to pin which kernel is the actual hot path.

Asks

  1. Could you confirm whether you've tested --parallel >= 4 with turbo3 on any hardware? README benchmarks (TURBOQUANT.md) appear to be single-stream RX 7900 XTX.
  2. If you have profiling data on Ampere with multi-slot, does turbo3 show similar prompt-length-dependent TG degradation, or is this Volta-specific?
  3. Suggested investigation path? Happy to run any nsys/ncu traces if you have a profiling-capable build to send.

Single-stream turbo3 works great on Volta — the underlying kernels are fine. The issue is specifically multi-slot serving with growing KV cache, which is the production deployment target. Reproduces 100% on this hardware.

Why this matters for Jetson deployment

Production prod uses --parallel 8 with c=524288 to serve multiple users. Switching q8_0 -> turbo3 would free ~6 GB of unified memory (5.12× compression vs f16), but the TG regression on long conversations cancels the upside. If we can fix the multi-slot path, this becomes a clean win for edge deployment on 30 GB unified-memory Jetson hardware.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions