-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Overview
0.5.0 is a major feature + hardening release. 131 commits, 875 files changed, +85k/-28k lines. Delivers fused GPU kernels, FP8/quantized compute, structured sparsity, and significantly expanded autograd coverage — while also refactoring and deduplicating extensively.
Downstream integration validated: solvr and boostr tested against numr 0.5.0 — this release unblocks publishing updated versions of both.
Closes via #6.
Completed
Fused Operations
- Fused GEMM epilogue: matmul+bias+activation in a single kernel (forward + backward)
- Fused activation-mul for gated architectures (SwiGLU, SiLU-mul)
- Fused add-norm: residual add + normalization in one pass (forward + backward)
- Fused elementwise operation chains across all backends
FP8 & Quantized Compute
- FP8 (E4M3/E5M2) matmul across all backends
- FP8 kernel support across CUDA compute paths
- i8×i8→i32 quantized matrix multiplication (CPU)
Sparse
- 2:4 structured sparsity with multi-backend support
Autograd Expansion
- Differentiable conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add-norm, dtype cast, narrow, cat, gather
- Activation checkpointing
- Backward hooks for distributed gradient sync
Performance
- CUDA caching allocator (replaces stream-ordered alloc)
- CUDA pipelined D2H copy for concurrent execution
- GEMV-BT fast paths across CPU/CUDA/WebGPU
- Online softmax in SIMD kernels
- Welford algorithm for numerically stable variance
- AVX2 transcendental/special function SIMD kernels
- Tiled GEMM with dual-accumulator FMA microkernels (AVX2/AVX-512/NEON)
- Half-precision GEMV-BT acceleration (f16/bf16)
Runtime & Infrastructure
- CUDA graph capture support
- NCCL communicator for multi-GPU collectives
- Nexar inter-node communicator
- Seeded deterministic RNG across all backends
- Internal RNG (removed external rand/rand_distr dependency)
- Slice assign operation across all backends
- Streaming sync ops for compute-communication overlap
Architecture
- Runtime::DType associated type
- CPU backend made unconditional
- Backward pass accumulation in precision-appropriate float type
- Static WGSL shaders replacing runtime generation
Code Organization (completed splits)
- Autograd reduce ops — split by operation
- CPU AVX2 math kernels — split by function category
- CUDA sparse merge kernels — split by strategy
- CUDA index kernel launchers — split into modules
Downstream Integration
- Test solvr against numr 0.5.0
- Test boostr against numr 0.5.0
- Resolve build failures in downstream crates
Fixes
- aarch64 NEON: replaced non-existent
vmvnq_u64with correct bitwise NOT - Softmax NaN prevention for -inf inputs
- Contiguity check for size-1 dim strides
- CUDA graph capture allocator freeze/unfreeze
- Batched matmul broadcasting across all backends
- F16/BF16 backward pass numerical stability (f32 accumulation)
Deferred to 0.6.0
Tracked separately:
- Error handling cleanup (~1,400 unwraps)
- Remaining oversized file splits (22 files)
- Migration guide (ndarray/PyTorch)
- API stability audit
- Second-order derivative fragility fix
- Remaining autograd ops (complex, scatter, index_select)
- CI hardening (cargo audit, coverage metrics)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels