Skip to content

ml-rust/numr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

710 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

numr

Foundational numerical computing for Rust

numr provides n-dimensional tensors, linear algebra, FFT, statistics, and automatic differentiation—with native GPU acceleration across CPU, CUDA, and WebGPU backends.

numr is like Numpy in Rust but built with gradients, GPUs, and modern dtypes built-in from day one.

What numr Is

A foundation library - Mathematical building blocks for higher-level libraries and applications.

numr IS numr is NOT
Tensor library (like NumPy's ndarray) A deep learning framework
Linear algebra (decompositions, solvers) A high-level ML API
FFT, statistics, random distributions Domain-specific
Native GPU (CUDA + WebGPU) + autograd

For SciPy-equivalent functionality (optimization, ODE, interpolation, signal), see solvr.

Why numr?

vs NumPy

Capability NumPy numr
N-dimensional tensors
Linear algebra, FFT, stats
Automatic differentiation ✗ Need JAX/PyTorch ✓ Built-in numr::autograd
GPU acceleration ✗ Need CuPy/JAX ✓ Native CUDA + WebGPU
Non-NVIDIA GPUs ✗ None ✓ AMD, Intel, Apple via WebGPU
FP8 / BF16 dtypes ✗ / Partial ✓ Full support
Sparse tensors ✗ SciPy separate, 2D only ✓ Integrated, N-dimensional
Same code CPU↔GPU ✗ Different libraries Tensor<R> abstraction

vs Rust Ecosystem

Fragmented libraries that don't interoperate and lack GPU support. numr consolidates everything:

Task Old Ecosystem numr
Tensors ndarray Tensor
Linear algebra nalgebra / faer numr::linalg
FFT rustfft numr::fft
Sparse sprs / ndsparse numr::sparse (feature-gated)
Statistics statrs numr::statistics
Random numbers rand + manual distributions numr::random + multivariate
GPU support None CPU, CUDA, WebGPU
Automatic differentiation None numr::autograd

A Rust developer should never need to look elsewhere for numerical computing.

Architecture

numr is designed with a simple principle: same code, any backend.

┌──────────────────────────────────────────────────────────────┐
│                    Your Application                          │
│               (any backend-agnostic code)                    │
└──────────────────────────────────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
   ┌────▼────┐          ┌────▼────┐          ┌───▼────┐
   │ CPU     │          │ CUDA    │          │ WebGPU │
   │ Runtime │          │ Runtime │          │Runtime │
   └────┬────┘          └────┬────┘          └───┬────┘
        │                    │                   │
   ┌────▼──────────┬─────────┴───────┬───────────▼───┐
   │     Trait     │                 │               │
   │  Implemen-    │  Same Algorithm │  Different    │
   │  tations      │  Different Code │  Hardware     │
   └───────────────┴─────────────────┴───────────────┘

Operations

numr implements a comprehensive set of tensor operations across CPU, CUDA, and WebGPU:

Core Arithmetic

  • UnaryOps: neg, abs, sqrt, exp, log, sin, cos, tan, sinh, cosh, tanh, floor, ceil, round, and more
  • BinaryOps: add, sub, mul, div, pow, maximum, minimum (all with NumPy-style broadcasting)
  • ScalarOps: tensor-scalar arithmetic
  • TypeConversionOps: cast (convert between dtypes)
  • UtilityOps: clamp, fill, arange, linspace, eye

Shape and Data Movement

  • ShapeOps: cat, stack, split, chunk, repeat, pad, roll
  • IndexingOps: gather, scatter, gather_nd, scatter_reduce, index_select, masked_select, masked_fill, embedding_lookup, bincount, argmax, argmin, slice_assign
  • SortingOps: sort, argsort, topk, unique, nonzero, searchsorted

Reductions

  • ReduceOps: sum, mean, max, min, prod (with precision variants)
  • CumulativeOps: cumsum, cumprod, logsumexp

Comparisons and Logical

  • CompareOps: eq, ne, lt, le, gt, ge
  • LogicalOps: logical_and, logical_or, logical_xor, logical_not
  • ConditionalOps: where (ternary conditional)

Activation & Normalization Functions

  • ActivationOps: relu, sigmoid, silu, gelu, swiglu, leaky_relu, elu, softmax, dropout, fused activation-mul (for gated architectures)
  • NormalizationOps: rms_norm, layer_norm, batch_norm, group_norm, instance_norm, fused add-norm (residual + normalize in one pass)
  • GemmEpilogueOps: fused matmul+bias+activation in a single kernel (forward + backward)
  • FusedElementwiseOps: fused element-wise operation chains across all backends
  • ConvOps: conv1d, conv2d, depthwise_conv2d (with stride, padding, dilation, groups)
  • EinsumOps: Einstein summation notation

These are mathematical functions commonly used in ML, but numr itself is not an ML framework.

Linear Algebra

  • MatmulOps: matmul, matmul_bias (fused GEMM+bias), i8×i8→i32 quantized matmul, FP8 matmul
  • LinalgOps: solve, lstsq, pinverse, inverse, det, trace, matrix_rank, diag, matrix_norm, kron, khatri_rao
  • ComplexOps: conj, real, imag, angle (for complex tensor support)

Automatic Differentiation

  • Reverse-mode: Var<R> tracked tensors, backward() for gradient computation
  • Forward-mode: jvp(), jacobian_forward() via dual numbers
  • Second-order: hvp() for Hessian-vector products, backward_with_graph() for higher-order gradients
  • Activation checkpointing: checkpoint() to trade compute for memory
  • Backward hooks: BackwardHook trait for gradient notifications (e.g., distributed allreduce)
  • Differentiable ops: matmul, conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add-norm, dtype cast, narrow, cat

Statistics and Probability

  • StatisticalOps: var, std, skew, kurtosis, quantile, percentile, median, cov, corrcoef
  • RandomOps: rand, randn, randint, multinomial, bernoulli, poisson, binomial, beta, gamma, exponential, chi_squared, student_t, f_distribution (with seeded deterministic generation)
  • MultivariateRandomOps: multivariate_normal, wishart, dirichlet
  • QuasirandomOps: Sobol, Halton sequences

Distance Metrics

  • DistanceOps: euclidean, manhattan, cosine, hamming, jaccard, minkowski, chebyshev, correlation

Algorithm Modules

Linear Algebra (numr::linalg):

  • Decompositions: LU, QR, Cholesky, SVD, Schur, full eigendecomposition, generalized eigenvalues
  • Solvers: solve, lstsq, pinverse
  • Matrix functions: exp, log, sqrt, sign
  • Utilities: det, trace, rank, matrix norms

Fast Fourier Transform (numr::fft):

  • FFT/IFFT (1D, 2D, ND) - Stockham algorithm
  • Real FFT (RFFT/IRFFT)

Matrix Multiplication (numr::matmul):

  • Tiled GEMM with register blocking
  • Bias fusion support

Special Functions (numr::special):

  • Error functions: erf, erfc, erfinv
  • Gamma functions: gamma, lgamma, digamma
  • Beta functions: beta, betainc
  • Incomplete gamma: gammainc, gammaincc
  • Bessel functions: J₀, J₁, Y₀, Y₁, I₀, I₁, K₀, K₁
  • Elliptic integrals: ellipk, ellipe
  • Hypergeometric functions: hyp2f1, hyp1f1
  • Airy functions: airy_ai, airy_bi
  • Legendre functions: legendre_p, legendre_p_assoc, sph_harm
  • Fresnel integrals: fresnel_s, fresnel_c

Polynomial Operations (numr::polynomial):

  • polyroots, polyval, polyfromroots, polymul

Iterative Solvers (numr::iterative):

  • Linear solvers: CG, MINRES, BiCGSTAB, GMRES, LGMRES, CGS, QMR, Jacobi, SOR, Adaptive GMRES
  • Eigensolvers: Lanczos (symmetric), Arnoldi/IRAM (non-symmetric)
  • Sparse SVD: via Lanczos bidiagonalization
  • Preconditioners: ILU(0), IC(0), Algebraic Multigrid (AMG) with V-cycles

Sparse Tensors (numr::sparse, feature-gated):

  • Formats: CSR, CSC, COO
  • Operations: SpGEMM (sparse matrix multiplication), SpMV (sparse matrix-vector), DSMM (dense-sparse matrix)
  • 2:4 structured sparsity with multi-backend support

Sparse Linear Algebra (numr::sparse_linalg):

  • Direct solvers: Sparse LU (Gilbert-Peierls), sparse QR
  • Incomplete factorizations: ILU(0), ILU(k), IC(0)
  • Preprocessing: COLAMD ordering, maximum transversal
  • Symbolic/numeric split: Reuse sparsity structure for repeated solves

Graph Capture (numr::runtime):

  • Graph trait: Capture a sequence of operations and replay them with zero re-launch overhead
  • CUDA Graphs: Full capture support—fixed-address buffer replay for inference loops and training steps
  • CPU / WebGPU: Transparent no-op path; callers write backend-agnostic code using R::supports_graph_capture()

Distributed Computing (numr::communicator, feature nccl):

  • CommunicatorGroup: Single-node multi-GPU all-reduce, broadcast, and allgather via NCCL
  • HierarchicalCommunicator: Two-level collective—NCCL intra-node, nexar inter-node
  • NexarNetCommunicator: Pure-Rust distributed transport (QUIC via nexar) for multi-machine tensor parallelism
  • BackwardHook: Autograd hook interface—trigger cross-node gradient synchronization during backward()

Dtypes

numr supports a wide range of numeric types:

Type Size CPU CUDA WebGPU Feature
f64 8B -
f32 4B -
f16 2B f16
bf16 2B f16
fp8e4m3 1B fp8
fp8e5m2 1B fp8
i64 8B -
i32 4B -
i16 2B -
i8 1B -
u64 8B -
u32 4B -
u16 2B -
u8 1B -
bool 1B -

Every operation supports every compatible dtype. No hardcoded f32-only kernels.

Backends

All backends implement identical algorithms with native kernels—no cuBLAS, MKL, or vendor library dependencies.

Hardware Backend Feature Status Notes
CPU (x86-64) CPU cpu (default) AVX-512/AVX2 SIMD
CPU (ARM64) CPU cpu NEON SIMD
NVIDIA GPU CUDA cuda Native PTX kernels, caching allocator, GEMV fast paths
AMD GPU WebGPU wgpu WGSL shaders
Intel GPU WebGPU wgpu WGSL shaders
Apple GPU WebGPU wgpu WGSL shaders
AMD GPU ROCm - Planned Native HIP kernels

SIMD Acceleration

The CPU backend automatically detects and uses the best available SIMD instruction set at runtime:

Architecture Instruction Set Vector Width Elements per Op
x86-64 AVX-512F + FMA 512 bits 16 f32 / 8 f64
x86-64 AVX2 + FMA 256 bits 8 f32 / 4 f64
ARM64 NEON 128 bits 4 f32 / 2 f64

Vectorized operations include:

  • Element-wise: add, sub, mul, div, neg, abs, sqrt, exp, log, sin, cos, tanh, and more
  • Reductions: sum, max, min, prod with horizontal SIMD reductions
  • Activations: sigmoid, silu, gelu, leaky_relu, elu
  • Normalization: softmax, rms_norm, layer_norm, logsumexp
  • Matrix multiplication: tiled GEMM with FMA microkernels
  • Special functions: erf, erfc, bessel, gamma (with polynomial approximations)

Why Native Kernels?

numr uses native kernels (SIMD, PTX, WGSL) by default—not cuBLAS/MKL wrappers.

Vendor Libraries (cuBLAS/MKL) numr Native Kernels
Transparency Black box Whitebox—inspect, debug, step through
Portability NVIDIA-only (cuBLAS) CPU, NVIDIA, AMD, Intel, Apple
Reproducibility Heuristics change between versions Bit-exact results, fixed in crate version
Choice Locked in Swap in vendor kernels if needed
Dependencies 2GB+ CUDA toolkit Minimal
Deployment Complex linking Simple static binaries

You're not locked out of vendor libraries—numr's kernel system is extensible. Use native kernels for portability and transparency, or swap in cuBLAS/MKL for maximum vendor-optimized performance. Other libraries don't give you this choice.

Quick Start

CPU Example

use numr::prelude::*;
use numr::runtime::cpu::CpuRuntime;

fn main() -> Result<()> {
    // Create tensors
    let a = Tensor::<CpuRuntime>::from_slice(
        &[1.0, 2.0, 3.0, 4.0],
        &[2, 2],
    )?;
    let b = Tensor::<CpuRuntime>::from_slice(
        &[5.0, 6.0, 7.0, 8.0],
        &[2, 2],
    )?;

    // Arithmetic (with broadcasting)
    let c = a.add(&b)?;
    let d = a.mul(&b)?;

    // Matrix multiplication
    let e = a.matmul(&b)?;

    // Reductions
    let sum = c.sum()?;
    let mean = c.mean()?;
    let max = c.max()?;

    // Element-wise functions
    let exp = a.exp()?;
    let sqrt = a.sqrt()?;

    // Reshaping (zero-copy)
    let flat = c.reshape(&[4])?;
    let transposed = c.transpose()?;

    Ok(())
}

GPU Example (CUDA)

use numr::prelude::*;
use numr::runtime::cuda::CudaRuntime;

fn main() -> Result<()> {
    // Create on GPU
    let device = CudaRuntime::default_device()?;
    let a = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;
    let b = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;

    // Operations run on GPU (native CUDA kernels)
    let c = a.matmul(&b)?;

    // Transfer result to CPU when needed
    let cpu_result = c.to_cpu()?;
    let data = cpu_result.to_vec::<f32>()?;

    Ok(())
}

Backend-Generic Code

use numr::prelude::*;
use numr::runtime::Runtime;
use numr::tensor::Tensor;

// Works on CPU, CUDA, or WebGPU
fn matrix_operations<R: Runtime>(
    a: &Tensor<R>,
    b: &Tensor<R>,
    client: &R::Client,
) -> Result<Tensor<R>> {
    // Same code, any backend
    let c = client.add(a, b)?;
    let d = client.matmul(&c, a)?;
    client.sum(&d)
}

// Use the same function on different hardware
fn main() -> Result<()> {
    let a_cpu = Tensor::<CpuRuntime>::randn(&[128, 128], &device_cpu)?;
    let b_cpu = Tensor::<CpuRuntime>::randn(&[128, 128], &device_cpu)?;
    let result_cpu = matrix_operations(&a_cpu, &b_cpu, &client_cpu)?;

    #[cfg(feature = "cuda")]
    {
        let device_cuda = CudaRuntime::default_device()?;
        let a_cuda = Tensor::<CudaRuntime>::randn(&[128, 128], &device_cuda)?;
        let b_cuda = Tensor::<CudaRuntime>::randn(&[128, 128], &device_cuda)?;
        let result_cuda = matrix_operations(&a_cuda, &b_cuda, &client_cuda)?;
    }

    Ok(())
}

Linear Algebra

use numr::prelude::*;
use numr::algorithm::linalg::{LinalgOps, Decomposition};

fn main() -> Result<()> {
    let a = Tensor::<CpuRuntime>::randn(&[64, 64], &device)?;

    // LU decomposition
    let (p, l, u) = client.lu(&a)?;

    // QR decomposition
    let (q, r) = client.qr(&a)?;

    // SVD
    let (u, s, vt) = client.svd(&a)?;

    // Eigendecomposition
    let (eigenvalues, eigenvectors) = client.eig(&a)?;

    // Solve linear system: Ax = b
    let b = Tensor::<CpuRuntime>::randn(&[64, 32], &device)?;
    let x = client.solve(&a, &b)?;

    // Determinant, trace, rank
    let det = client.det(&a)?;
    let tr = client.trace(&a)?;
    let rank = client.matrix_rank(&a)?;

    Ok(())
}

FFT

use numr::prelude::*;
use numr::algorithm::fft::FftOps;

fn main() -> Result<()> {
    let x = Tensor::<CpuRuntime>::randn(&[1024], &device)?;

    // Complex FFT
    let fft_result = client.fft(&x)?;
    let inverse = client.ifft(&fft_result)?;

    // Real FFT (more efficient for real-valued inputs)
    let rfft_result = client.rfft(&x)?;
    let irfft_result = client.irfft(&rfft_result, 1024)?;

    // 2D FFT
    let image = Tensor::<CpuRuntime>::randn(&[256, 256], &device)?;
    let fft_2d = client.fft_2d(&image)?;

    Ok(())
}

Statistics and Distributions

use numr::prelude::*;

fn main() -> Result<()> {
    let data = Tensor::<CpuRuntime>::randn(&[1000], &device)?;

    // Descriptive statistics
    let mean = client.mean(&data)?;
    let std = client.std(&data)?;
    let var = client.var(&data)?;
    let median = client.median(&data)?;
    let q25 = client.quantile(&data, 0.25)?;

    // Statistical measures
    let skewness = client.skew(&data)?;
    let kurtosis = client.kurtosis(&data)?;

    // Covariance and correlation
    let x = Tensor::<CpuRuntime>::randn(&[100, 5], &device)?;
    let y = Tensor::<CpuRuntime>::randn(&[100, 5], &device)?;
    let cov = client.cov(&x)?;
    let corr = client.corrcoef(&x)?;

    // Random distributions
    let normal = Tensor::<CpuRuntime>::randn(&[1000], &device)?; // mean=0, std=1
    let uniform = Tensor::<CpuRuntime>::rand(&[1000], &device)?; // [0, 1)
    let gamma = client.gamma(&[1000], shape, scale, &device)?;
    let poisson = client.poisson(&[1000], lambda, &device)?;

    // Multivariate distributions
    let mvn = client.multivariate_normal(&[100], &mean, &cov)?;
    let wishart = client.wishart(&[10], df, &scale_matrix)?;

    Ok(())
}

Automatic Differentiation

use numr::prelude::*;
use numr::autograd::*;

fn main() -> Result<()> {
    let client = CpuRuntime::client()?;

    // Create tracked variables
    let x = Var::new(Tensor::<CpuRuntime>::from_slice(&[2.0, 3.0], &[2])?, true);
    let w = Var::new(Tensor::<CpuRuntime>::from_slice(&[0.5, -1.0], &[2])?, true);

    // Forward pass (builds computation graph)
    let y = var_mul(&x, &w, &client)?;
    let loss = var_sum(&y, &client)?;

    // Backward pass
    let grads = backward(&loss, &client)?;
    let dx = grads.get(x.tensor());  // gradients w.r.t. x
    let dw = grads.get(w.tensor());  // gradients w.r.t. w

    // Activation checkpointing (trade compute for memory)
    let checkpointed = checkpoint(|inputs| {
        let h = var_relu(&inputs[0], &client)?;
        var_matmul(&h, &inputs[1], &client)
    }, &[&x, &w])?;

    // Forward-mode AD (Jacobian-vector products)
    let tangent = Tensor::<CpuRuntime>::ones(&[2], &device)?;
    let jvp_result = jvp(|x| client.mul(x, x), &x.tensor(), &tangent, &client)?;

    // Hessian-vector product
    let hvp_result = hvp(|x, c| c.mul(x, x), &x.tensor(), &tangent, &client)?;

    Ok(())
}

Installation

CPU-only (default)

[dependencies]
numr = "*"

With GPU Support

[dependencies]
# NVIDIA CUDA (requires CUDA 12.0+)
numr = { version = "*", features = ["cuda"] }

# Cross-platform GPU (NVIDIA, AMD, Intel, Apple)
numr = { version = "*", features = ["wgpu"] }

With Optional Features

[dependencies]
numr = { version = "*", features = [
    "cuda",      # NVIDIA GPU support
    "wgpu",      # Cross-platform GPU (WebGPU)
    "f16",       # Half-precision (F16, BF16)
    "sparse",    # Sparse tensors
] }

Feature Flags

Feature Description Default
cpu CPU backend (AVX-512/AVX2 on x86-64, NEON on ARM64)
cuda NVIDIA CUDA backend
wgpu Cross-platform GPU (WebGPU)
rayon Multi-threaded CPU via Rayon
f16 Half-precision floats (F16, BF16)
fp8 FP8 precision (E4M3, E5M2)
sparse Sparse tensor support (CSR, CSC, COO)
nccl Multi-GPU communication via NCCL

Building from Source

# CPU only
cargo build --release

# With CUDA
cargo build --release --features cuda

# With WebGPU
cargo build --release --features wgpu

# With all features
cargo build --release --features cuda,wgpu,f16,sparse

# Run tests
cargo test --release
cargo test --release --features cuda
cargo test --release --features wgpu

# Run benchmarks
cargo bench

How numr Fits in the Stack

numr is the foundation that everything else builds on:

┌──────────────────────────────────────────────────┐
│  Your Application                                 │
│  (data science, simulation, finance, ML, etc.)   │
└─────────────────────────┬────────────────────────┘
                          │
┌─────────────────────────▼────────────────────────┐
│  solvr - Scientific Computing (like SciPy)       │
│  Optimization, ODE/PDE, interpolation, signal    │
│  https://github.com/ml-rust/solvr                │
└─────────────────────────┬────────────────────────┘
                          │
┌─────────────────────────▼────────────────────────┐
│  numr - Foundations (like NumPy)  ◄── YOU ARE HERE│
│  Tensors, linalg, FFT, statistics, random        │
│  Native CPU, CUDA, WebGPU kernels + autograd     │
└──────────────────────────────────────────────────┘

numr : solvr :: NumPy : SciPy

When numr's kernels improve, everything above improves automatically.

Kernels and Extensibility

numr provides default kernels for all operations. You can also:

  • Use default kernels: All operations work out of the box with optimized kernels:
    • CPU: SIMD-vectorized kernels (AVX-512/AVX2 on x86-64, NEON on ARM64)
    • CUDA: Native PTX kernels (compiled at build time, loaded on first use)
    • WebGPU: WGSL compute shaders for cross-platform GPU
  • Replace specific kernels: Swap in your own optimized kernels for performance-critical paths
  • Add new operations: Define new traits and implement kernels for all backends

For detailed guidance on writing custom kernels, adding new operations, and backend-specific optimization techniques, see docs/extending-numr.md.

License

Apache-2.0

About

A high-performance numerical computing library for Rust with GPU acceleration, inspired by Numpy

Topics

Resources

License

Stars

Watchers

Forks

Contributors