A pure Rust, built-from-scratch LLM inference engine.
SlowInfer 🐌 is a from-scratch LLM inference engine written purely in Rust.
This project was built during the Spring Festival just to kill time
and some CPU cycles.
- From Scratch: Tensor, operators, tokenizers, GGUF parser, and model architectures are all implemented from scratch.
- PyTorch-like Tensor API: Intuitive interface supporting
reshape,view,permute, advanced indexing, broadcasting, ... - Model Support: Includes a minimal implementation of the Qwen3 architecture.
- OpenAI-style HTTP API:
/v1/chat/completions,/v1/completions, and/v1/models.
🚧 Work in progress: APIs and model coverage are subject to change.
- Currently capable of running
Qwen3-0.6B-Q8_0.gguf - Execution is CPU-only for now
- Many essential features are still under development
- Bring a GGUF file (e.g.
Qwen3-0.6B-Q8_0.gguf). - Start the server:
cargo run --release --bin server -- --gguf Qwen3-0.6B-Q8_0.gguf --host 127.0.0.1 --port 8765- Send a test request:
curl http://127.0.0.1:8765/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "slowinfer-qwen3",
"messages": [{"role": "user", "content": "Hello!"}]
}'- Sit back and relax. It's called SlowInfer for a reason.
- KV-Cache
- High-performance operators
- Memory-mapped weight loading
- Broader quantization support
- More tokenizers, samplers, and model architectures
- Maybe rename the project once we hit these milestones 😈
SlowInfer is licensed under MIT.
See LICENSE for details.
