blazr

Production-grade inference server for LLMs. Supports standard HuggingFace models (Llama, Mistral, Qwen, Phi, Gemma, DeepSeek) and custom hybrid architectures (Mamba2, MLA, MoE). Loads SafeTensors, AWQ, GPTQ, and GGUF formats.

Features

Multi-Architecture - Llama, Mistral, Mamba2, MLA+MoE, hybrid models, and more
Multi-Format - SafeTensors (F16/BF16), AWQ INT4, GPTQ INT4, GGUF (23 quantization levels)
Auto-Detection - Detects architecture, format, and tokenizer from checkpoint tensor names
OpenAI-Compatible API - Drop-in replacement with /v1/completions and /v1/chat/completions
Streaming - Server-Sent Events (SSE) for real-time token generation
HuggingFace Hub - Pull models directly from HuggingFace
Production Features - Rate limiting, request timeouts, graceful shutdown, CORS, error handling
CUDA Acceleration - Optional GPU inference with optimized quantization kernels

Quick Start

Installation

# Build (CPU-only)
cargo build --release

# Build with CUDA support (requires CUDA 12.x)
cargo build --release --features cuda

Generate Text

blazr run \
  --model meta-llama/Llama-3.2-1B \
  --prompt "Once upon a time" \
  --max-tokens 100

Start Server

blazr serve --model meta-llama/Llama-3.2-1B --port 8080

# Text completion
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100,
    "stream": true
  }'

Other Commands

blazr info --model ./models/mistral-7b     # Show model architecture and config
blazr list                                  # List available local models
blazr pull Qwen/Qwen2.5-0.5B               # Download from HuggingFace

Supported Models

Tested

Model	SafeTensors	AWQ	GPTQ	GGUF
Llama 3.2 1B	x	x	x	x
Mistral 7B	x	x	x	x
Mamba2 (oxidizr)	x	-	-
DeepSeek-V2/V3 (MLA+MoE)	x
Hybrid Mamba+Attn (oxidizr)	x	-	-

Expected Compatible (route to Llama)

Qwen2/2.5, Phi-3/3.5/4, Gemma/Gemma2, StarCoder2, Yi, InternLM2, CodeLlama, Solar

Planned

Mixtral (MoE), Falcon (ALiBi), Command-R, GPT-NeoX/Pythia, DBRX

Model Formats

blazr auto-detects the format from checkpoint contents:

SafeTensors (HuggingFace)

model_dir/
├── model.safetensors          # or sharded: model-00001-of-00002.safetensors
├── config.json                # HuggingFace model config
└── tokenizer_config.json      # Chat template (optional)

SafeTensors (oxidizr)

checkpoint_dir/
├── model.safetensors
└── config.json                # oxidizr config (optional, inferred from tensors)

GGUF

model.gguf                     # Single file: weights + tokenizer + config

Supports all 23 GGUF quantization levels (Q2_K through Q8_0, IQ series, TQ series). CPU has dedicated kernels for all formats. CUDA has optimized dp4a kernels for Q4_K, Q6_K, Q8_0 with generic fallback for the rest.

API Reference

Endpoints

Method	Path	Description
GET	`/health`	Health check
GET	`/v1/models`	List loaded models
POST	`/v1/completions`	Text completion
POST	`/v1/chat/completions`	Chat completion

Request Parameters

{
  "prompt": "text",
  "messages": [{ "role": "user", "content": "text" }],
  "max_tokens": 100,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "stop": ["\n\n"],
  "stream": false
}

Response Format

Non-streaming responses follow the OpenAI format:

{
  "id": "cmpl-...",
  "object": "text_completion",
  "created": 1234567890,
  "choices": [
    {
      "text": "generated text",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 20,
    "total_tokens": 25
  }
}

Streaming responses use SSE with data: {...} chunks and data: [DONE] sentinel.

CLI Reference

blazr run       Generate text from a prompt
blazr serve     Start the inference server
blazr info      Display model architecture and configuration
blazr list      List available local models
blazr pull      Download a model from HuggingFace Hub

Generation Options

Flag	Default	Description
`--model`	required	Local path or HuggingFace model ID
`--prompt`	required	Input text
`--max-tokens`	100	Maximum tokens to generate
`--temperature`	0.7	Sampling temperature (0 = greedy)
`--top-p`	0.9	Nucleus sampling threshold
`--top-k`	40	Top-k sampling
`--vocab`	auto	Tokenizer vocabulary (auto-detected from model)
`--cpu`	false	Force CPU inference

Server Options

Flag	Default	Description
`--model`	required	Local path or HuggingFace model ID
`--port`	8080	Port to listen on
`--host`	0.0.0.0	Host to bind to
`--cpu`	false	Force CPU inference

Tokenizer

Uses splintr for BPE tokenization. Vocabulary is auto-detected from model config or can be specified with --vocab.

Vocabulary	Models	Vocab Size
`llama3`	Llama 3.x, Mistral	~128k
`cl100k_base`	GPT-4, GPT-3.5-turbo	~100k
`o200k_base`	GPT-4o	~200k
`deepseek_v3`	DeepSeek V3/R1	~129k

GGUF files include an embedded tokenizer which is extracted automatically.

Architecture

blazr is a thin application layer built on boostr (ML framework) and numr (numerical computing). All model architectures, tensor operations, and quantization kernels live in boostr. blazr provides the CLI, HTTP server, model loading, and request orchestration.

blazr (CLI + HTTP server + model lifecycle)
  |
boostr (model architectures + quant kernels + NN modules)
  |
numr (tensors + linalg + multi-backend: CPU/CUDA/WebGPU)

Requirements

Rust 1.70+
(Optional) CUDA 12.x for GPU acceleration

License

Apache-2.0 - see LICENSE for details.

Related Projects

boostr - ML framework (model architectures, quantization)
numr - Foundational numerical computing (tensors, linalg, multi-backend)
oxidizr - Training framework for hybrid architectures
splintr - High-performance BPE tokenizer
compressr - Model conversion and compression

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
docs		docs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blazr

Features

Quick Start

Installation

Generate Text

Start Server

Other Commands

Supported Models

Tested

Expected Compatible (route to Llama)

Planned

Model Formats

SafeTensors (HuggingFace)

SafeTensors (oxidizr)

GGUF

API Reference

Endpoints

Request Parameters

Response Format

CLI Reference

Generation Options

Server Options

Tokenizer

Architecture

Requirements

License

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

blazr

Features

Quick Start

Installation

Generate Text

Start Server

Other Commands

Supported Models

Tested

Expected Compatible (route to Llama)

Planned

Model Formats

SafeTensors (HuggingFace)

SafeTensors (oxidizr)

GGUF

API Reference

Endpoints

Request Parameters

Response Format

CLI Reference

Generation Options

Server Options

Tokenizer

Architecture

Requirements

License

Related Projects

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages