Summary
KV cache quantization (-ctk q8_0 -ctv q8_0) works correctly on CPU (Raspberry Pi 4) but crashes with SEGFAULT on Jetson Nano CUDA (SM 5.3).
What works
On Raspberry Pi 4 (CPU-only, PrismML fork unmodified):
- Bonsai-8B with
-ctk q8_0 -ctv q8_0 -c 4096
- KV cache: 306 MB (vs 576 MB with FP16) — 270 MB saved
- Speed: 0.6-0.8 tok/s (slight regression vs FP16)
- No crash, correct output
What crashes
On Jetson Nano (llamita.cpp, CUDA 10.2, SM 5.3):
- Bonsai-8B with
-ctk q8_0 -ctv q8_0 -c 4096
- Model loads, KV cache allocates (306 MB), compute buffer reserves
- SEGFAULT during warm-up (first inference)
llama_kv_cache: CUDA0 KV buffer size = 306.00 MiB
llama_kv_cache: size = 306.00 MiB (4096 cells, 36 layers, 4/1 seqs), K (q8_0): 153.00 MiB, V (q8_0): 153.00 MiB
sched_reserve: CUDA0 compute buffer size = 304.23 MiB
...
Main process exited, code=dumped, status=11/SEGV
Probable cause
The Q1_0 CUDA kernels (from PrismML fork) likely don't handle quantized KV cache types in the attention KQ*V multiplication. The CUDA attention path may assume FP16 KV values and crash when encountering Q8_0 blocks.
Additionally, our CUDA 10.2 patches (removing if constexpr guards) may have broken type-dispatch logic that protects against unsupported KV type combinations.
Impact
Fixing this on the Jetson would save 270 MB of RAM, enabling:
- Context 8192+ (currently limited to 4096 with 980 MB free)
- More headroom for system stability
Environment
- Jetson Nano: CUDA 10.2, SM 5.3, llamita.cpp (20+ patches)
- Raspberry Pi 4: CPU-only, PrismML fork unmodified, ARM NEON
Reproduction
# Crashes on Jetson:
./llama-server -m bonsai-8b.gguf -ngl 99 -c 4096 -ctk q8_0 -ctv q8_0
# Works on RPi:
./llama-server -m bonsai-8b.gguf -c 4096 -ctk q8_0 -ctv q8_0
Summary
KV cache quantization (
-ctk q8_0 -ctv q8_0) works correctly on CPU (Raspberry Pi 4) but crashes with SEGFAULT on Jetson Nano CUDA (SM 5.3).What works
On Raspberry Pi 4 (CPU-only, PrismML fork unmodified):
-ctk q8_0 -ctv q8_0 -c 4096What crashes
On Jetson Nano (llamita.cpp, CUDA 10.2, SM 5.3):
-ctk q8_0 -ctv q8_0 -c 4096Probable cause
The Q1_0 CUDA kernels (from PrismML fork) likely don't handle quantized KV cache types in the attention KQ*V multiplication. The CUDA attention path may assume FP16 KV values and crash when encountering Q8_0 blocks.
Additionally, our CUDA 10.2 patches (removing
if constexprguards) may have broken type-dispatch logic that protects against unsupported KV type combinations.Impact
Fixing this on the Jetson would save 270 MB of RAM, enabling:
Environment
Reproduction