Skip to content

0.5.0: Fused ops, FP8 compute, 2:4 sparsity, autograd expansion, production hardening #5

@farhan-syah

Description

@farhan-syah

Overview

0.5.0 is a major feature + hardening release. 131 commits, 875 files changed, +85k/-28k lines. Delivers fused GPU kernels, FP8/quantized compute, structured sparsity, and significantly expanded autograd coverage — while also refactoring and deduplicating extensively.

Downstream integration validated: solvr and boostr tested against numr 0.5.0 — this release unblocks publishing updated versions of both.

Closes via #6.


Completed

Fused Operations

  • Fused GEMM epilogue: matmul+bias+activation in a single kernel (forward + backward)
  • Fused activation-mul for gated architectures (SwiGLU, SiLU-mul)
  • Fused add-norm: residual add + normalization in one pass (forward + backward)
  • Fused elementwise operation chains across all backends

FP8 & Quantized Compute

  • FP8 (E4M3/E5M2) matmul across all backends
  • FP8 kernel support across CUDA compute paths
  • i8×i8→i32 quantized matrix multiplication (CPU)

Sparse

  • 2:4 structured sparsity with multi-backend support

Autograd Expansion

  • Differentiable conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add-norm, dtype cast, narrow, cat, gather
  • Activation checkpointing
  • Backward hooks for distributed gradient sync

Performance

  • CUDA caching allocator (replaces stream-ordered alloc)
  • CUDA pipelined D2H copy for concurrent execution
  • GEMV-BT fast paths across CPU/CUDA/WebGPU
  • Online softmax in SIMD kernels
  • Welford algorithm for numerically stable variance
  • AVX2 transcendental/special function SIMD kernels
  • Tiled GEMM with dual-accumulator FMA microkernels (AVX2/AVX-512/NEON)
  • Half-precision GEMV-BT acceleration (f16/bf16)

Runtime & Infrastructure

  • CUDA graph capture support
  • NCCL communicator for multi-GPU collectives
  • Nexar inter-node communicator
  • Seeded deterministic RNG across all backends
  • Internal RNG (removed external rand/rand_distr dependency)
  • Slice assign operation across all backends
  • Streaming sync ops for compute-communication overlap

Architecture

  • Runtime::DType associated type
  • CPU backend made unconditional
  • Backward pass accumulation in precision-appropriate float type
  • Static WGSL shaders replacing runtime generation

Code Organization (completed splits)

  • Autograd reduce ops — split by operation
  • CPU AVX2 math kernels — split by function category
  • CUDA sparse merge kernels — split by strategy
  • CUDA index kernel launchers — split into modules

Downstream Integration

  • Test solvr against numr 0.5.0
  • Test boostr against numr 0.5.0
  • Resolve build failures in downstream crates

Fixes

  • aarch64 NEON: replaced non-existent vmvnq_u64 with correct bitwise NOT
  • Softmax NaN prevention for -inf inputs
  • Contiguity check for size-1 dim strides
  • CUDA graph capture allocator freeze/unfreeze
  • Batched matmul broadcasting across all backends
  • F16/BF16 backward pass numerical stability (f32 accumulation)

Deferred to 0.6.0

Tracked separately:

  • Error handling cleanup (~1,400 unwraps)
  • Remaining oversized file splits (22 files)
  • Migration guide (ndarray/PyTorch)
  • API stability audit
  • Second-order derivative fragility fix
  • Remaining autograd ops (complex, scatter, index_select)
  • CI hardening (cargo audit, coverage metrics)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions