Skip to content

caozhanhao/slowinfer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

A pure Rust, built-from-scratch LLM inference engine.

Intro

SlowInfer 🐌 is a from-scratch LLM inference engine written purely in Rust.

This project was built during the Spring Festival just to kill time and some CPU cycles.

Features

  • From Scratch: Tensor, operators, tokenizers, GGUF parser, and model architectures are all implemented from scratch.
  • PyTorch-like Tensor API: Intuitive interface supporting reshape, view, permute, advanced indexing, broadcasting, ...
  • Model Support: Includes a minimal implementation of the Qwen3 architecture.
  • OpenAI-style HTTP API: /v1/chat/completions, /v1/completions, and /v1/models.

Status

🚧 Work in progress: APIs and model coverage are subject to change.

  • Currently capable of running Qwen3-0.6B-Q8_0.gguf
  • Execution is CPU-only for now
  • Many essential features are still under development

Quickstart

  1. Bring a GGUF file (e.g. Qwen3-0.6B-Q8_0.gguf).
  2. Start the server:
cargo run --release --bin server -- --gguf Qwen3-0.6B-Q8_0.gguf --host 127.0.0.1 --port 8765
  1. Send a test request:
curl http://127.0.0.1:8765/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "slowinfer-qwen3",
      "messages": [{"role": "user", "content": "Hello!"}]
    }'
  1. Sit back and relax. It's called SlowInfer for a reason.

Roadmap

  • KV-Cache
  • High-performance operators
  • Memory-mapped weight loading
  • Broader quantization support
  • More tokenizers, samplers, and model architectures
  • Maybe rename the project once we hit these milestones 😈

License

SlowInfer is licensed under MIT.
See LICENSE for details.

About

A LLM inference engine built from scratch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors