Skip to content

KV cache quantization (q8_0) crashes with SEGFAULT on Jetson Nano CUDA, works on CPU #2

@coverblew

Description

@coverblew

Summary

KV cache quantization (-ctk q8_0 -ctv q8_0) works correctly on CPU (Raspberry Pi 4) but crashes with SEGFAULT on Jetson Nano CUDA (SM 5.3).

What works

On Raspberry Pi 4 (CPU-only, PrismML fork unmodified):

  • Bonsai-8B with -ctk q8_0 -ctv q8_0 -c 4096
  • KV cache: 306 MB (vs 576 MB with FP16) — 270 MB saved
  • Speed: 0.6-0.8 tok/s (slight regression vs FP16)
  • No crash, correct output

What crashes

On Jetson Nano (llamita.cpp, CUDA 10.2, SM 5.3):

  • Bonsai-8B with -ctk q8_0 -ctv q8_0 -c 4096
  • Model loads, KV cache allocates (306 MB), compute buffer reserves
  • SEGFAULT during warm-up (first inference)
llama_kv_cache: CUDA0 KV buffer size = 306.00 MiB
llama_kv_cache: size = 306.00 MiB (4096 cells, 36 layers, 4/1 seqs), K (q8_0): 153.00 MiB, V (q8_0): 153.00 MiB
sched_reserve: CUDA0 compute buffer size = 304.23 MiB
...
Main process exited, code=dumped, status=11/SEGV

Probable cause

The Q1_0 CUDA kernels (from PrismML fork) likely don't handle quantized KV cache types in the attention KQ*V multiplication. The CUDA attention path may assume FP16 KV values and crash when encountering Q8_0 blocks.

Additionally, our CUDA 10.2 patches (removing if constexpr guards) may have broken type-dispatch logic that protects against unsupported KV type combinations.

Impact

Fixing this on the Jetson would save 270 MB of RAM, enabling:

  • Context 8192+ (currently limited to 4096 with 980 MB free)
  • More headroom for system stability

Environment

  • Jetson Nano: CUDA 10.2, SM 5.3, llamita.cpp (20+ patches)
  • Raspberry Pi 4: CPU-only, PrismML fork unmodified, ARM NEON

Reproduction

# Crashes on Jetson:
./llama-server -m bonsai-8b.gguf -ngl 99 -c 4096 -ctk q8_0 -ctv q8_0

# Works on RPi:
./llama-server -m bonsai-8b.gguf -c 4096 -ctk q8_0 -ctv q8_0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions