gguf-runner is a pure Rust, CPU-first inference runtime for GGUF language models.
The project focuses on:
- straightforward local inference
- readable code structure
- support for multiple model families in one binary
- Build:
cargo build --release- Run with a local GGUF file:
cargo run --release -- \
--model ./Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
--prompt "Explain what this project does."- Show all options:
cargo run -- --helpRequired flags:
--model <path>--prompt <text>
Common optional flags:
--system-prompt <text>--temperature <float>--top-k <int>--top-p <float>--repeat-penalty <float>--repeat-last-n <int>--max-tokens <int>--context-size <int>--threads <int>--show-tokens--show-timings--profiling--debug--url <model-url>(lazy bootstrap/download path for missing or invalid local file)
This runtime currently supports multiple model families (Llama, Gemma, Qwen variants), common GGUF quantization types, and platform-specific CPU optimizations.
For detailed feature coverage and platform notes, see:
docs/features.md
For historical benchmark snapshots and performance notes, see:
docs/performance.md
For current module/layout reference, see:
docs/module-structure.md
- CPU inference only
- GGUF model files only
- focus on transparent implementation over broad framework abstraction
Issues and pull requests are welcome.
Before opening a PR, run:
cargo fmt --all --check
cargo clippy --all-targets --all-features
cargo check