turbo3 KV multi-slot TG regression on NVIDIA Volta sm_72 — prompt-length dependent

## Summary

`turbo3` KV cache shows a clean **+12% TG** improvement over `q8_0` in single-stream tests on Jetson Xavier AGX (sm_72, CUDA 11.4), but regresses **-12% to -22%** under `--parallel >= 4` with longer prompts. The regression scales linearly with prompt length. `q8_0` shows no equivalent regression.

## Hardware / build context

- Jetson Xavier AGX (sm_72 Volta, 30 GB unified memory)
- CUDA 11.4.315, gcc 9.4
- Branch: `feature/triattention-scoring` at `f9a308d`
- Build: `cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=72 -DGGML_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release` — clean, all turbo fattn-vec instances compiled

## Reproduction

Two binaries built from the same source. Same model (Qwen3.5-0.8B-Q4_K_M) for the small-model isolation, Qwen3.6-35B-A3B-UD-Q4_K_XL for the production-size validation.

### Single-stream (`llama-bench -p 1024 -n 128 -r 3`)

| KV (K/V) | TG t/s |
|---|---|
| q8_0 / q8_0 | 44.33 |
| **turbo3 / turbo3** | **49.66 (+12%)** |

`turbo3` clearly wins — Sparse V skip path is active and helping.

### Multi-slot (`llama-server --parallel 8`, single client, same prompt)

| KV (K/V) | TG t/s |
|---|---|
| q8_0 / q8_0 | 44.62 |
| **turbo3 / turbo3** | **39.18 (-12%)** |

Same hardware, same model, identical request — only `--parallel 8` and KV-type differ.

### Production-scale (Qwen3.6-35B-A3B Q4_K_XL, --parallel 8, --mlock, c=524288)

| Prompt size | KV | TG t/s |
|---|---|---|
| 2 tokens | turbo3 | 15.52 |
| 200 tokens | turbo3 | 15.12 |
| 562 tokens | turbo3 | 14.15 |
| 1.7K tokens | turbo3 | 12.01 |
| (any) | turbo2 [the legacy 8-bit alias upstream] | ~15 |

Linear degradation with prompt length under multi-slot. Single-stream the same model holds 15+ across prompt sizes.

## Failed hypothesis (sharing so it isn't repeated)

I hypothesized the **Sparse V skip** in `fattn-vec.cuh` lines 349/395 was the cause — `if (dominated) { continue; }` triggers warp divergence, and the code already disables this on HIP/RDNA3 (commit `ae9814cd5`, "branch overhead exceeds dequant savings"). I patched the guard to also disable it on `__CUDA_ARCH__ < 800`:

```c
#if !defined(GGML_USE_HIP) && (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800)
```

**Result was the opposite of what I expected.** Disabling Sparse V on Volta caused single-stream `turbo3` TG to drop from 49.66 → 35.06 (-29%, ±8.81), while q8_0 held at 44.11. So **Sparse V is what gives `turbo3` its single-stream win on Volta** — turning it off forces full V dequant on every position. Reverted.

This means the multi-slot regression is *not* caused by Sparse V divergence cost. The fact that single-stream `turbo3` runs the Sparse V path *more aggressively* (more skips at long context) yet still beats q8_0 single-stream suggests the divergence cost itself is fine on sm_72.

## Remaining hypotheses (not tested — would need profiler access)

1. **`kv_unified=false` per-slot KV layout interaction** with turbo3's structured rotation per row. Multi-slot KV cells are in disjoint sub-buffers, possibly hurting cache locality more for `turbo3`'s WHT-rotated reads than `q8_0`'s simpler row layout.
2. **K-side attention sharpening** (`α=1.0357` per the log) applied per-slot — wondering if it's redundant work in the multi-slot path.
3. **WHT rotation matrix** in `turbo-wht.cu` — maybe per-slot init not amortized, or shared-memory contention scales worse than q8_0 dequant.

I don't have nsight-compute on a Jetson + CUDA 11.4 toolchain, so I can't profile to pin which kernel is the actual hot path.

## Asks

1. Could you confirm whether you've tested `--parallel >= 4` with `turbo3` on any hardware? README benchmarks (`TURBOQUANT.md`) appear to be single-stream RX 7900 XTX.
2. If you have profiling data on Ampere with multi-slot, does `turbo3` show similar prompt-length-dependent TG degradation, or is this Volta-specific?
3. Suggested investigation path? Happy to run any `nsys`/`ncu` traces if you have a profiling-capable build to send.

Single-stream `turbo3` works great on Volta — the underlying kernels are fine. The issue is specifically multi-slot serving with growing KV cache, which is the production deployment target. Reproduces 100% on this hardware.

## Why this matters for Jetson deployment

Production prod uses `--parallel 8` with `c=524288` to serve multiple users. Switching `q8_0` -> `turbo3` would free ~6 GB of unified memory (5.12× compression vs f16), but the TG regression on long conversations cancels the upside. If we can fix the multi-slot path, this becomes a clean win for edge deployment on 30 GB unified-memory Jetson hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turbo3 KV multi-slot TG regression on NVIDIA Volta sm_72 — prompt-length dependent #11

Summary

Hardware / build context

Reproduction

Single-stream (`llama-bench -p 1024 -n 128 -r 3`)

Multi-slot (`llama-server --parallel 8`, single client, same prompt)

Production-scale (Qwen3.6-35B-A3B Q4_K_XL, --parallel 8, --mlock, c=524288)

Failed hypothesis (sharing so it isn't repeated)

Remaining hypotheses (not tested — would need profiler access)

Asks

Why this matters for Jetson deployment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Prompt size	KV	TG t/s
2 tokens	turbo3	15.52
200 tokens	turbo3	15.12
562 tokens	turbo3	14.15
1.7K tokens	turbo3	12.01
(any)	turbo2 [the legacy 8-bit alias upstream]	~15

turbo3 KV multi-slot TG regression on NVIDIA Volta sm_72 — prompt-length dependent #11

Description

Summary

Hardware / build context

Reproduction

Single-stream (llama-bench -p 1024 -n 128 -r 3)

Multi-slot (llama-server --parallel 8, single client, same prompt)

Production-scale (Qwen3.6-35B-A3B Q4_K_XL, --parallel 8, --mlock, c=524288)

Failed hypothesis (sharing so it isn't repeated)

Remaining hypotheses (not tested — would need profiler access)

Asks

Why this matters for Jetson deployment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Single-stream (`llama-bench -p 1024 -n 128 -r 3`)

Multi-slot (`llama-server --parallel 8`, single client, same prompt)