Skip to content

ml-rust/blazr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

blazr

Rust License: Apache-2.0 Build Status

Production-grade inference server for LLMs. Supports standard HuggingFace models (Llama, Mistral, Qwen, Phi, Gemma, DeepSeek) and custom hybrid architectures (Mamba2, MLA, MoE). Loads SafeTensors, AWQ, GPTQ, and GGUF formats.

Features

  • Multi-Architecture - Llama, Mistral, Mamba2, MLA+MoE, hybrid models, and more
  • Multi-Format - SafeTensors (F16/BF16), AWQ INT4, GPTQ INT4, GGUF (23 quantization levels)
  • Auto-Detection - Detects architecture, format, and tokenizer from checkpoint tensor names
  • OpenAI-Compatible API - Drop-in replacement with /v1/completions and /v1/chat/completions
  • Streaming - Server-Sent Events (SSE) for real-time token generation
  • HuggingFace Hub - Pull models directly from HuggingFace
  • Production Features - Rate limiting, request timeouts, graceful shutdown, CORS, error handling
  • CUDA Acceleration - Optional GPU inference with optimized quantization kernels

Quick Start

Installation

# Build (CPU-only)
cargo build --release

# Build with CUDA support (requires CUDA 12.x)
cargo build --release --features cuda

Generate Text

blazr run \
  --model meta-llama/Llama-3.2-1B \
  --prompt "Once upon a time" \
  --max-tokens 100

Start Server

blazr serve --model meta-llama/Llama-3.2-1B --port 8080
# Text completion
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100,
    "stream": true
  }'

Other Commands

blazr info --model ./models/mistral-7b     # Show model architecture and config
blazr list                                  # List available local models
blazr pull Qwen/Qwen2.5-0.5B               # Download from HuggingFace

Supported Models

Tested

Model SafeTensors AWQ GPTQ GGUF
Llama 3.2 1B x x x x
Mistral 7B x x x x
Mamba2 (oxidizr) x - -
DeepSeek-V2/V3 (MLA+MoE) x
Hybrid Mamba+Attn (oxidizr) x - -

Expected Compatible (route to Llama)

Qwen2/2.5, Phi-3/3.5/4, Gemma/Gemma2, StarCoder2, Yi, InternLM2, CodeLlama, Solar

Planned

Mixtral (MoE), Falcon (ALiBi), Command-R, GPT-NeoX/Pythia, DBRX

Model Formats

blazr auto-detects the format from checkpoint contents:

SafeTensors (HuggingFace)

model_dir/
├── model.safetensors          # or sharded: model-00001-of-00002.safetensors
├── config.json                # HuggingFace model config
└── tokenizer_config.json      # Chat template (optional)

SafeTensors (oxidizr)

checkpoint_dir/
├── model.safetensors
└── config.json                # oxidizr config (optional, inferred from tensors)

GGUF

model.gguf                     # Single file: weights + tokenizer + config

Supports all 23 GGUF quantization levels (Q2_K through Q8_0, IQ series, TQ series). CPU has dedicated kernels for all formats. CUDA has optimized dp4a kernels for Q4_K, Q6_K, Q8_0 with generic fallback for the rest.

API Reference

Endpoints

Method Path Description
GET /health Health check
GET /v1/models List loaded models
POST /v1/completions Text completion
POST /v1/chat/completions Chat completion

Request Parameters

{
  "prompt": "text",
  "messages": [{ "role": "user", "content": "text" }],
  "max_tokens": 100,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "stop": ["\n\n"],
  "stream": false
}

Response Format

Non-streaming responses follow the OpenAI format:

{
  "id": "cmpl-...",
  "object": "text_completion",
  "created": 1234567890,
  "choices": [
    {
      "text": "generated text",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 20,
    "total_tokens": 25
  }
}

Streaming responses use SSE with data: {...} chunks and data: [DONE] sentinel.

CLI Reference

blazr run       Generate text from a prompt
blazr serve     Start the inference server
blazr info      Display model architecture and configuration
blazr list      List available local models
blazr pull      Download a model from HuggingFace Hub

Generation Options

Flag Default Description
--model required Local path or HuggingFace model ID
--prompt required Input text
--max-tokens 100 Maximum tokens to generate
--temperature 0.7 Sampling temperature (0 = greedy)
--top-p 0.9 Nucleus sampling threshold
--top-k 40 Top-k sampling
--vocab auto Tokenizer vocabulary (auto-detected from model)
--cpu false Force CPU inference

Server Options

Flag Default Description
--model required Local path or HuggingFace model ID
--port 8080 Port to listen on
--host 0.0.0.0 Host to bind to
--cpu false Force CPU inference

Tokenizer

Uses splintr for BPE tokenization. Vocabulary is auto-detected from model config or can be specified with --vocab.

Vocabulary Models Vocab Size
llama3 Llama 3.x, Mistral ~128k
cl100k_base GPT-4, GPT-3.5-turbo ~100k
o200k_base GPT-4o ~200k
deepseek_v3 DeepSeek V3/R1 ~129k

GGUF files include an embedded tokenizer which is extracted automatically.

Architecture

blazr is a thin application layer built on boostr (ML framework) and numr (numerical computing). All model architectures, tensor operations, and quantization kernels live in boostr. blazr provides the CLI, HTTP server, model loading, and request orchestration.

blazr (CLI + HTTP server + model lifecycle)
  |
boostr (model architectures + quant kernels + NN modules)
  |
numr (tensors + linalg + multi-backend: CPU/CUDA/WebGPU)

Requirements

  • Rust 1.70+
  • (Optional) CUDA 12.x for GPU acceleration

License

Apache-2.0 - see LICENSE for details.

Related Projects

  • boostr - ML framework (model architectures, quantization)
  • numr - Foundational numerical computing (tensors, linalg, multi-backend)
  • oxidizr - Training framework for hybrid architectures
  • splintr - High-performance BPE tokenizer
  • compressr - Model conversion and compression

About

Production-grade inference server for LLMs. Supports standard HuggingFace models (Llama, Mistral, Qwen, Phi, Gemma, DeepSeek) and custom hybrid architectures (Mamba2, MLA, MoE). Loads SafeTensors, AWQ, GPTQ, and GGUF formats

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages