boostr

ML framework built on numr — attention, quantization, model architectures.

boostr extends numr with production-grade ML primitives. It provides attention mechanisms, quantization support, model architectures, and inference infrastructure — all built on numr's foundational tensors, runtimes, and ops. No reimplementation. No wrappers. Pure extension traits.

Key Capabilities

Quantization

26 formats (GGUF-compatible): Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2K–Q8K, IQ1S–IQ4XS, TQ1_0, TQ2_0
QuantTensor type for block-quantized data
Per-backend kernels: Native SIMD (CPU), PTX (CUDA), WGSL (WebGPU)
Zero-copy GGUF loading with memory mapping

Attention

Flash Attention v2/v3 with fused QKV projection
Multi-Head Latent Attention (MLA) — compressed KV cache
Grouped Query Attention (GQA) and multi-head variants
Paged attention for memory-efficient inference
Variable-length attention with ragged tensors
Prefix caching for context reuse

Position Encodings

RoPE: Split-half, interleaved, ALiBi variants
YaRN for length extrapolation
Efficient fused implementations on all backends

Model Architectures

LLaMA — standard and tensor-parallelized
Mamba2 — state space models with SSD kernels
Hybrid — mixed transformer/SSM models
Extensible architecture system for custom models

Neural Network Modules

Linear — standard and quantized variants
Embedding for token embeddings
LayerNorm, RMSNorm with fused implementations
MoE layers with expert routing and load balancing

Inference Infrastructure

Paged KV cache with block allocator for memory efficiency
Request scheduler with continuous batching
Prefix caching for prompt reuse
Speculative decoding with adaptive draft depth and verification kernels
Flash decoding for single-token decode (CUDA, auto-selected when S_q=1)

Training

Optimizers: AdamW, Lamb, SGD with gradient clipping
Mixed precision (AMP) with automatic loss scaling
Gradient accumulation and checkpointing
Learning rate scheduling (warmup, cosine, linear decay)
Distributed training:
- ZeRO stage 1/2/3 (parameter/gradient/optimizer sharding)
- Tensor parallelism with communicators
- Pipeline parallelism (1F1B, Gpipe, ZeroBubble schedules)

Model Loading

SafeTensors: Zero-copy memory-mapped loading
GGUF: Full format support with block-quantized tensors
Format auto-detection

Multi-Backend

CPU: SIMD kernels (AVX2, NEON), native ops
CUDA: PTX kernels, Flash Attention v2/v3, fused ops (CUDA 12.x)
WebGPU: WGSL shaders, cross-platform GPU support

Architecture

┌───────────────────────────────────────────────────────┐
│                    boostr                             │
│   (attention, RoPE, MoE, quantization, model loaders) │
└──────────────────────────┬────────────────────────────┘
                           │
                        (uses)
                           │
┌──────────────────────────▼───────────────────────────┐
│                      numr                            │
│   (tensors, ops, runtime, autograd, linalg, FFT)     │
└──────────────────────────────────────────────────────┘

Design principles:

Extension traits: ML ops (AttentionOps, RoPEOps) implemented on numr's clients — not new types
QuantTensor: Separate type for quantized data with custom kernels
impl_generic: Composite ops composed from numr primitives, same logic on all backends
Custom kernels: Dequant, quantized matmul, fused attention use per-backend optimizations (SIMD/PTX/WGSL)
Vendor-agnostic: No cuBLAS, cuDNN, or MKL; all native kernels

Quick Start

Installation

Add to Cargo.toml:

[dependencies]
boostr = "<latest-version>"

# With CUDA support (requires CUDA 12.x)
# boostr = { version = "0.1", features = ["cuda"] }

# With WebGPU support
# boostr = { version = "0.1", features = ["wgpu"] }

Build

# CPU build
cargo build --release

# CUDA support (requires CUDA 12.x)
cargo build --release --features cuda

# WebGPU support
cargo build --release --features wgpu

# Run tests
cargo test
cargo test --features cuda

Basic Usage

use boostr::*;
use boostr::ops::traits::attention::flash::FlashAttentionOps;
use numr::ops::RandomOps;
use numr::runtime::cpu::{CpuClient, CpuDevice};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = CpuClient::new(CpuDevice::new());

    // Create random tensors via numr's RandomOps
    let q = client.randn(&[1, 8, 32, 64], DType::F32)?; // [batch, heads, seq, dim]
    let k = client.randn(&[1, 8, 32, 64], DType::F32)?;
    let v = client.randn(&[1, 8, 32, 64], DType::F32)?;

    // Flash Attention forward pass
    let (output, _lse) = client.flash_attention_fwd(
        &q, &k, &v,
        8,    // num_heads
        8,    // num_kv_heads
        64,   // head_dim
        true, // causal
        0,    // window_size (0 = no sliding window)
        None, // kv_seq_len override
    )?;

    Ok(())
}

Loading a Model

use boostr::format::Gguf;
use boostr::{CpuRuntime, DType};
use numr::runtime::Runtime;

// Open a GGUF model file (with optional memory mapping)
let mut gguf = Gguf::open("model.gguf")?;
let metadata = gguf.metadata();
let device = <CpuRuntime as Runtime>::Device::default();

// Load tensors — quantized as QuantTensor, others as f32
for name in gguf.tensor_names().map(|s| s.to_string()).collect::<Vec<_>>() {
    let info = gguf.tensor_info(&name)?;
    if info.ggml_type.is_quantized() {
        let qt = gguf.load_tensor_quantized::<CpuRuntime>(&name, &device)?;
    } else {
        let t = gguf.load_tensor_f32::<CpuRuntime>(&name, &device)?;
    }
}

Inference with KV Cache

use boostr::inference::PagedKvCache;

// Create a paged KV cache for efficient inference
let mut kv_cache = PagedKvCache::new(
    &client,
    num_layers,
    batch_size,
    max_seq_len,
    head_dim,
)?;

// Process tokens with cache
for token_idx in 0..seq_len {
    // ... forward pass using kv_cache ...
    kv_cache.update(layer_idx, &k, &v)?;
}

Feature Flags

Feature	Purpose	Dependencies
`cpu`	CPU backend (default)	numr
`cuda`	CUDA GPU acceleration	numr/cuda, cudarc
`nccl`	Multi-GPU via NCCL	numr/nccl
`wgpu`	WebGPU cross-platform GPU	numr/wgpu
`f16`	Half-precision float support	numr/f16
`fp8`	FP8 precision support	numr/fp8

Module Overview

ops/ — ML-specific operations (attention, RoPE, MoE, etc.)
quant/ — Quantized tensors and kernels (26 formats)
nn/ — Neural network modules (Linear, Embedding, LayerNorm, RMSNorm, MoE)
model/ — Model architectures (LLaMA, Mamba2, Hybrid)
format/ — Model loaders (SafeTensors, GGUF)
inference/ — Inference infrastructure (KV cache, scheduling, batching)
optimizer/ — Training optimizers (AdamW, Lamb, SGD)
trainer/ — Training utilities and distributed training (ZeRO, tensor/pipeline parallelism)
distributed/ — Multi-GPU coordination

Performance

boostr provides production-grade performance through:

Fused kernels — Attention, layer norm, optimizer steps compiled to single kernels
Custom quantization — Per-format SIMD/PTX/WGSL kernels for dequant and quantized matmul
Memory efficiency — Paged KV cache, prefix caching, gradient checkpointing
Distributed training — ZeRO stages, tensor/pipeline parallelism with minimal communication overhead
Zero-copy loading — Memory-mapped GGUF with quantized weights

Ecosystem

boostr is part of the ml-rust organization:

numr — Foundational numerical computing (tensors, ops, linalg, FFT)
boostr — ML framework (this project)
oxidizr — Training framework for Mamba2, MLA, MoE (uses boostr)
blazr — Inference server with OpenAI-compatible API (uses boostr)
compressr — Model quantization and compression (uses boostr)
splintr — High-performance BPE tokenizer

Building from Source

Requirements

Rust 1.85+
For CUDA: CUDA 12.x and cudarc dependencies
For WebGPU: wgpu and platform GPU drivers

Clone and Build

git clone https://github.com/ml-rust/boostr.git
cd boostr

# CPU
cargo build --release

# CUDA
cargo build --release --features cuda

# Run tests
cargo test --all-features

# Format and lint
cargo fmt --all
cargo clippy --all-targets

Documentation

API Documentation — Full reference for public API
numr Documentation — Tensor and runtime types

Testing

# Run all tests
cargo test --all-features

# Specific test suite
cargo test ops::attention --all-features

# Verbose output
cargo test --all-features -- --nocapture

Contributing

Contributions are welcome! Please see the main repository's contribution guidelines.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Acknowledgments

boostr builds on the numerical foundation provided by numr and is designed to power production ML infrastructure across training (oxidizr) and inference (blazr).

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs

Folders and files

Latest commit

History

Repository files navigation

boostr

Key Capabilities

Quantization

Attention

Position Encodings

Model Architectures

Neural Network Modules

Inference Infrastructure

Training

Model Loading

Multi-Backend

Architecture

Quick Start

Installation

Build

Basic Usage

Loading a Model

Inference with KV Cache

Feature Flags

Module Overview

Performance

Ecosystem

Building from Source

Requirements

Clone and Build

Documentation

Testing

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages