genie-ai-runtime

LLM inference engine for Jetson Orin Nano Super 8 GB. Built for one job: serve a single small model fast and predictably in the memory budget left over after voice STT, TTS, denoise, and a Home Assistant container.

v1.0.0 — first stable release. See ROADMAP.md for how we got here (the alpha-track path-by-path narrative) and what comes next.

$ ./build/jetson-llm -m models/Qwen3-4B-Q4_K_M.gguf -p "Hello"
[engine] Model loaded in 1325 ms (1797 MB/s)
Model: Qwen3 4B Instruct Awq (36 layers, 32 heads, 8 KV heads, 2560 dim)
Hello! How can I assist you today?
[engine] Prefill: 13 tokens in 453 ms (28.7 tok/s)
[engine] Decode:   10 tokens in 916 ms (10.9 tok/s)
[engine] TTFT:    463 ms

At a glance

Target HW: Jetson Orin Nano Super 8 GB (SM 8.7, 102 GB/s, 67 TOPS GPU). Not portable.
Models: GGUF only. Validated on Qwen3-4B-Q4_K_M; the architecture path supports any Qwen-family / Llama-3-family model the loader can parse.
Kernels: custom CUDA for SM 8.7 — INT4 dequant-fused GEMV, tensor-core MMQ Q4_K prefill, flash attention, fused RMSNorm + RoPE + SwiGLU.
Memory model: pre-allocated KV and scratch pools accounted before any inference starts; OOM-guard prevents crashes.
KV cache: INT8 by default (alpha.12, FP16-ULP-bounded drift); FP16 opt-in. Persistent across turns (Path F) when a conversation_id or nvext.agent_hints.session_id is given, with per-response KV reuse counters.
Two binaries:
- jetson-llm — single-prompt / interactive CLI (default build).
- jetson-llm-server — OpenAI-compatible HTTP server (opt-in: -DJLLM_BUILD_SERVER=ON).

Why this exists

Existing runtimes aren't shaped for 8 GB unified memory that's already sharing real-time voice traffic with the LLM:

llama.cpp — portable, generic CUDA kernels, no Jetson memory awareness. The runtime genie-claw uses today; the one this project replaces.
TensorRT-LLM — fast but datacenter-shaped (A100 / H100); too heavy for Orin Nano's iGPU budget.
genie-ai-runtime — Orin-tuned CUDA kernels, pre-allocated pools, power-aware, single binary, single GGUF, single shared-memory budget that fits alongside whisper-server and genie-core.

Performance (Qwen3-4B Q4_K_M, 25 W MAXN SUPER)

Headline numbers on Jetson Orin Nano Super 8 GB at v1.0.0:

Workload	Number
Prefill (33-tok cold)	38.0 tok/s
Decode	9.9 tok/s
Cold TTFT (33-tok prompt)	877 ms
Warm-turn TTFT (Path F hydrated, 67% prefix)	444 ms
KV pool memory (1024 ctx)	74 MB (INT8 default)
Model load — cold (NVMe-bound)	30 s
Model load — warm (pagecache hit)	1.3 s

vs llama-bench pp18 = 17.97 ± 0.65 tok/s: +115 % prefill on the same hardware + model. See ROADMAP.md for the alpha-track perf evolution (alpha.2 → v1.0) and the path-by-path breakdown of how each gain landed.

Quickstart

Build

Prereqs: JetPack 6 (L4T R36.x, CUDA 12.6), CMake ≥ 3.20, gcc/g++ for C++17, git. See GeniePod/genie-os#1 for the full install list for a fresh Jetson.

git clone https://github.com/GeniePod/genie-ai-runtime.git
cd genie-ai-runtime
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Outputs:

build/jetson-llm — the CLI
libjetson_llm_core.a — the engine, link this from genie-claw

The HTTP server is opt-in (genie-claw embeds the engine library directly). Add -DJLLM_BUILD_SERVER=ON to also produce build/jetson-llm-server.

CLI — single prompt

./build/jetson-llm -m /path/to/model.gguf -p "Hello"

CLI — interactive chat

./build/jetson-llm -m /path/to/model.gguf -i --chat

quit / exit / Ctrl-C leaves the loop. --think allows Qwen3 reasoning output; omit for the no-think default.

Server — HTTP, OpenAI-compatible

cmake -B build -DCMAKE_BUILD_TYPE=Release -DJLLM_BUILD_SERVER=ON
cmake --build build -j$(nproc)

./build/jetson-llm-server -m /path/to/model.gguf -p 8080

OpenAI-shape endpoints, SSE streaming, Qwen3 reasoning split into reasoning_content. Full surface in docs/server.md. For systemd deployment: sudo ./scripts/setup.sh && sudo systemctl enable --now jetson-llm-server.

Project layout

Module	Header	Responsibility
`src/memory/`	`include/jllm_memory.h`	`MemoryBudget`, `OOMGuard`, `KVCachePool`, `ScratchPool` — every byte accounted before inference starts
`src/jetson/`	`include/jllm_jetson.h`	`PowerState` (nvpmodel 7–25 W), `ThermalState`, `JetsonInfo`, `LiveStats`
`src/kernels/`	`include/jllm_kernels.h`	Orin SM 8.7 CUDA — INT4 dequant-fused GEMV, tensor-core MMQ Q4_K prefill, flash attention, fused norm + RoPE + SwiGLU, FP16↔INT8 KV convert
`src/engine/`	`include/jllm_engine.h`	GGUF load, transformer forward pass, tokenizer, sampling
`src/persistence/`	`src/persistence/kv_cache_file.h`	Persistent KV cache (Path F) — atomic on-disk format, LRU eviction, model fingerprint
`src/server/`	—	Optional cpp-httplib server. OpenAI shape, SSE, reasoning split

Master header: include/jllm.h.

Documentation

Doc	Purpose
`ROADMAP.md`	The alpha-track narrative (every Path, every perf win, every PR). What we did and how.
`CHANGELOG.md`	Per-release notes alpha.0 → v1.0.0
`docs/build.md`	Detailed build + prereqs
`docs/server.md`	HTTP server reference (endpoints, request fields, SSE, systemd)
`docs/architecture.md`	Module-level design notes
`docs/kernels.md`	CUDA kernel reference
`docs/memory.md`	KV / scratch / OOM-guard design
`docs/performance.md`	Detailed perf scaling tables (alpha-track)
`docs/jetson-hal.md`	Power / thermal / nvpmodel handling
`docs/qwen3-vs-our-runtime.md`	Architecture notes specific to Qwen3
`docs/testing.md`	Test layout + run instructions
`docs/validation-week1.md`	Week 1 bring-up notes (historical)

Integration with genie-claw

genie-claw (the local home AI assistant this runtime is built for) links libjetson_llm_core.a directly and calls engine.generate() in-process. The HTTP server is only for non-embedded consumers and for A/B testing against llama-server.

genie-claw currently runs on llama-server; the flip to genie-ai-runtime is gated on the 24 h soak (issue #7).

Tests

cd build && ctest --output-on-failure

Stability soak harness for long-running validation:

./scripts/soak.sh /path/to/model.gguf --iters 100 --tokens 1024

Bench harnesses for cold/warm load and llama.cpp comparison live under scripts/. See docs/testing.md.

License

MIT — see LICENSE. Permissive on purpose so genie-claw (AGPL-3.0) and other consumers can embed the engine cheaply.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
deploy/systemd		deploy/systemd
docs		docs
include		include
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
TESTING.md		TESTING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

genie-ai-runtime

At a glance

Why this exists

Performance (Qwen3-4B Q4_K_M, 25 W MAXN SUPER)

Quickstart

Build

CLI — single prompt

CLI — interactive chat

Server — HTTP, OpenAI-compatible

Project layout

Documentation

Integration with genie-claw

Tests

License

Related

About

Uh oh!

Releases 10

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

genie-ai-runtime

At a glance

Why this exists

Performance (Qwen3-4B Q4_K_M, 25 W MAXN SUPER)

Quickstart

Build

CLI — single prompt

CLI — interactive chat

Server — HTTP, OpenAI-compatible

Project layout

Documentation

Integration with genie-claw

Tests

License

Related

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages