KV cache quantization (q8_0) crashes with SEGFAULT on Jetson Nano CUDA, works on CPU

## Summary

KV cache quantization (`-ctk q8_0 -ctv q8_0`) works correctly on CPU (Raspberry Pi 4) but crashes with SEGFAULT on Jetson Nano CUDA (SM 5.3).

## What works

On Raspberry Pi 4 (CPU-only, PrismML fork unmodified):
- Bonsai-8B with `-ctk q8_0 -ctv q8_0 -c 4096`
- KV cache: 306 MB (vs 576 MB with FP16) — **270 MB saved**
- Speed: 0.6-0.8 tok/s (slight regression vs FP16)
- No crash, correct output

## What crashes

On Jetson Nano (llamita.cpp, CUDA 10.2, SM 5.3):
- Bonsai-8B with `-ctk q8_0 -ctv q8_0 -c 4096`
- Model loads, KV cache allocates (306 MB), compute buffer reserves
- **SEGFAULT during warm-up** (first inference)

```
llama_kv_cache: CUDA0 KV buffer size = 306.00 MiB
llama_kv_cache: size = 306.00 MiB (4096 cells, 36 layers, 4/1 seqs), K (q8_0): 153.00 MiB, V (q8_0): 153.00 MiB
sched_reserve: CUDA0 compute buffer size = 304.23 MiB
...
Main process exited, code=dumped, status=11/SEGV
```

## Probable cause

The Q1_0 CUDA kernels (from PrismML fork) likely don't handle quantized KV cache types in the attention KQ*V multiplication. The CUDA attention path may assume FP16 KV values and crash when encountering Q8_0 blocks.

Additionally, our CUDA 10.2 patches (removing `if constexpr` guards) may have broken type-dispatch logic that protects against unsupported KV type combinations.

## Impact

Fixing this on the Jetson would save 270 MB of RAM, enabling:
- Context 8192+ (currently limited to 4096 with 980 MB free)
- More headroom for system stability

## Environment

- **Jetson Nano**: CUDA 10.2, SM 5.3, llamita.cpp (20+ patches)
- **Raspberry Pi 4**: CPU-only, PrismML fork unmodified, ARM NEON

## Reproduction

```bash
# Crashes on Jetson:
./llama-server -m bonsai-8b.gguf -ngl 99 -c 4096 -ctk q8_0 -ctv q8_0

# Works on RPi:
./llama-server -m bonsai-8b.gguf -c 4096 -ctk q8_0 -ctv q8_0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV cache quantization (q8_0) crashes with SEGFAULT on Jetson Nano CUDA, works on CPU #2

Summary

What works

What crashes

Probable cause

Impact

Environment

Reproduction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

KV cache quantization (q8_0) crashes with SEGFAULT on Jetson Nano CUDA, works on CPU #2

Description

Summary

What works

What crashes

Probable cause

Impact

Environment

Reproduction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions