Idea
Running Bonsai-8B on Jetson Nano in CPU-only mode (-ngl 0) instead of GPU could significantly reduce memory usage, similar to what we see on Raspberry Pi 4.
Expected benefits
|
GPU mode (current) |
CPU-only (proposed) |
| RAM used (est.) |
2500 MB |
~1400-1500 MB |
| RAM free (est.) |
980 MB |
~2400 MB |
| Speed |
1.1 tok/s |
~0.4-0.5 tok/s (A57 < A72) |
| KV Q8_0 |
SEGFAULT (#2) |
Should work |
| Max context |
4096 (tight) |
8K+ possible |
Why it matters
- 1 GB more free RAM for system stability
- KV cache Q8_0 would work (only crashes with CUDA kernels)
- Context could be doubled or more
- Trade-off: ~2x slower
Questions to investigate
-
Does the PrismML fork compile CPU-only on Jetson Nano (Ubuntu 18.04, GCC 8)?
- GCC 8 supports most C++17 but may need
-lstdc++fs
- The NEON patch from llamita.cpp may still be needed for GCC 8
- Need to verify which patches are CUDA-specific vs GCC 8-specific
-
Actual memory usage vs GPU mode
-
Actual speed on Cortex-A57 (slower than A72 on RPi)
-
Does KV Q8_0 work in CPU mode on Jetson?
Related
Idea
Running Bonsai-8B on Jetson Nano in CPU-only mode (
-ngl 0) instead of GPU could significantly reduce memory usage, similar to what we see on Raspberry Pi 4.Expected benefits
Why it matters
Questions to investigate
Does the PrismML fork compile CPU-only on Jetson Nano (Ubuntu 18.04, GCC 8)?
-lstdc++fsActual memory usage vs GPU mode
Actual speed on Cortex-A57 (slower than A72 on RPi)
Does KV Q8_0 work in CPU mode on Jetson?
Related