v0.5.0: fused ops, FP8 compute, 2:4 sparsity, autograd expansion#6
Merged
farhan-syah merged 132 commits intomainfrom Mar 14, 2026
Merged
v0.5.0: fused ops, FP8 compute, 2:4 sparsity, autograd expansion#6farhan-syah merged 132 commits intomainfrom
farhan-syah merged 132 commits intomainfrom
Conversation
Add Hash trait to Layout, Shape, and Strides to enable their use in hash-based collections. Fix contiguity check to correctly identify strided views that maintain row-major order regardless of offset. Extend Layout API with methods for common tensor operations: - Transpose operations (t, transpose_axes) - Dimension manipulation (squeeze_dim, squeeze_all, unsqueeze_at) - Flattening and permutation (flatten, permute_dims) - Advanced indexing (as_strided, index_to_offset, offset_to_index) - Broadcasting utilities (broadcast_shape, broadcast_shapes) - Storage calculations (storage_size) Add From trait implementations for ergonomic layout construction from tuples, arrays, and slices up to 6 dimensions.
Increment minor version to reflect new tensor layout API features including Hash trait implementations and comprehensive dimension manipulation methods.
…rait Split monolithic dtype/mod.rs into focused modules for better maintainability and extensibility. Introduces DataType trait to enable downstream libraries like boostr to define custom dtype enums with quantized variants while maintaining compatibility with numr's core tensor operations.
Enables runtimes to specify their dtype enum through an associated type, allowing downstream libraries to extend numr with custom quantized types while maintaining type safety and backend compatibility.
Updates Tensor to use Runtime's associated DType instead of hardcoded numr::DType, enabling extensibility for downstream libraries. Reorganizes tensor factory methods to separate generic DataType operations from concrete DType-specific constructors, improving code organization and reducing duplication.
Propagates Runtime<DType = DType> bounds throughout operation traits, implementation helpers, and shape utilities to support the new extensible dtype system while maintaining backward compatibility.
Propagates dtype trait bounds through linear algebra and polynomial algorithms, maintaining consistency with the new extensible type system for tensor decomposition, polynomial operations, and FFT-based convolutions.
Propagates dtype trait bounds through gradient computation and variable operations, ensuring type safety in automatic differentiation with the extensible dtype system.
Updates test utilities and backend parity checks to work with the new DataType trait, ensuring comprehensive validation across CPU, CUDA, and WebGPU backends with the extensible dtype architecture.
Introduce AllocationStats for profiling allocator behavior and TrackingAllocator<A> — a generic wrapper that layers thread-safe tracking on top of any Allocator implementation. TrackingAllocator records total allocations, total bytes, active allocation count, peak memory usage (high-water mark), and frozen state. Cloning shares the same Arc<Mutex<...>> state so that all handles observe the same counters. Two new error variants support the allocator lifecycle: - AllocatorBusy: reset rejected while live allocations exist - AllocatorFrozen: new allocations rejected while frozen The Allocator trait gains two defaulted methods: - stats() -> AllocationStats (zeroed default for non-tracking impls) - reset() -> Result<()> (no-op default) Tests cover: basic stat tracking, allocated_bytes(), freeze/unfreeze, reset success, reset-while-busy rejection, peak across cycles, clone state sharing, and freeze preservation through reset.
Add a set of commonly needed methods to Tensor<R> that reduce boilerplate in downstream code. Ergonomic aliases for existing accessors: - rank() -> ndim() alias - elem_count() -> numel() alias - dims() -> shape() alias returning &[usize] - len() -> numel() alias for Iterator/slice parity - is_empty() -> true when numel() == 0 Typed dimension access: - dim(index: isize) -> Result<usize>, negative-index aware - dims1() through dims5() unpack shape into typed tuples, returning ShapeMismatch when the rank does not match Low-level storage inspection: - offset() -> layout offset in elements - ptr() -> raw base storage pointer - data_ptr() -> ptr + offset * dtype_size (first element) - owns_memory() -> whether storage deallocates on drop - shares_storage_with() -> true when two tensors share a buffer - ref_count() -> storage Arc reference count Construction helper: - from_storage_contiguous(storage, shape) builds a Tensor directly from a Storage handle without going through a client Deep copy: - to_bytes() -> materializes tensor data as raw bytes (contiguous first) - clone_deep() -> full copy with independent storage
Introduce a Graph trait for capturing and replaying computation sequences, backed by CUDA Graphs on the CUDA runtime and a no-op eager path on CPU and WebGPU. - Add Graph trait with launch() and is_replay_capable() to src/runtime/graph.rs - Add NoOpGraph for CPU and WebGPU (operations execute eagerly during capture) - Add CudaGraph wrapping cudarc's CudaGraph behind Arc<Mutex<>> for Send + Sync - Add Runtime::Graph as a new associated type on the Runtime trait - Add Runtime::capture_graph() as a required method replacing the stub - Implement capture_graph() on CpuRuntime (eager), WgpuRuntime (eager), and CudaRuntime (real stream capture via cudarc begin_capture/end_capture) - CUDA implementation correctly ends capture even when the closure fails so the stream is never left in capture mode - Add unit tests for CPU eager execution, error propagation, and NoOpGraph - Update MockRuntime in external_backend_api.rs to satisfy the new trait bound
Replace bare R: Runtime bounds with R: Runtime<DType = DType> in all sites that work directly with DType values. This eliminates implicit assumptions about the associated type and makes each function's requirements explicit to the type checker. Affected sites: - fallback.rs: validate_binary_dtypes, compute_broadcast_shape, all fallback op helpers (binary, unary, scalar, reduce, activation, softmax, matmul, compare, where_cond, csc/coo elementwise) - statistics_common.rs: skew_composite, kurtosis_composite - impl_generic/linalg.rs: triangular_mask_impl, triu_impl, tril_impl, slogdet_impl - impl_generic/utility.rs: one_hot_impl Also remove an unconditional TypeConversionOps import in cuda/random.rs that is only needed under the fp8 feature flag, and drop an unused TypeConversionOps import in cuda/linalg/statistics.rs.
Adds integration tests verifying F16, BF16, FP8E4M3, and FP8E5M2 support across all ML-critical CPU operations: binary, scalar, unary, reduce, matmul, activations, and normalizations. Each dtype is audited end-to-end including round-trip casts from F32, with per-operation pass/fail reporting and a summary assertion to catch regressions in reduced-precision coverage.
…munication Introduces a runtime-level abstraction for collective and point-to-point communication across devices, supporting distributed FFT, parallel linear algebra, Monte Carlo simulations, and gradient synchronization. - `Communicator` trait with allreduce, broadcast, allgather, reducescatter, and point-to-point send/recv operations over raw device pointers - `ReduceOp` enum covering Sum, Prod, Min, Max reductions - `NoOpCommunicator` for single-device operation (world_size=1): in-place collectives are true no-ops, separate-buffer collectives perform a memcpy, point-to-point ops are no-ops - Re-export `Communicator`, `NoOpCommunicator`, and `ReduceOp` from `runtime` public API
Replace direct `.unwrap()` on Mutex::lock() calls with a private `lock()` helper that recovers from a poisoned lock via `into_inner()`. A poisoned lock means another thread panicked while holding it; the tracking counters may be inconsistent but the inner allocator remains usable, making recovery safer than propagating a panic to the caller.
All top-level items in ml_dtype_audit.rs are now guarded with #[cfg(any(feature = "f16", feature = "fp8"))] so the test file compiles cleanly without those optional features enabled.
Add `as_host_slice` and `as_host_slice_mut` unsafe methods to `Storage<R>` that return borrowed slices into CPU-backed memory without allocating. Both methods short-circuit on empty storage and document the safety invariants required of callers (valid host pointer, no aliasing for the mutable variant).
Implement NarrowBackward and CatBackward gradient functions, enabling autograd to propagate gradients through tensor slicing and concatenation. NarrowBackward pads the incoming gradient with zeros to restore the original shape along the narrowed dimension. CatBackward splits the output gradient back into per-input slices using narrow, reversing the concatenation exactly. Export var_narrow and var_cat from the autograd crate root alongside the existing shape op exports.
In CudaGraph::launch, recover from a poisoned mutex rather than panicking, consistent with the existing TrackingAllocator fix. In Storage::as_host_slice_mut, change the receiver from &self to &mut self so the mutable slice borrow is sound — a mutable slice must come from exclusive access to the backing storage.
Add slice_assign to IndexingOps, which copies a source tensor into a contiguous slice of a destination tensor along a given dimension starting at a specified index, returning a new tensor with the region replaced. Implemented natively on all three backends: - CPU: pointer-based kernel that copies dst then overwrites the slice region with src using dispatch_dtype - CUDA: PTX kernel instantiated for all supported dtypes (f32, f64, f16, bf16, i32, i64, fp8_e4m3, fp8_e5m2) via the existing launch_slice_assign launcher - WebGPU: WGSL compute shader generated per dtype (f32, i32, u32) with a SliceAssignParams uniform; get_buffer is widened to pub to support the bind group wiring Expose the operation on Tensor<R> via Tensor::slice_assign for ergonomic use at the call site.
…tives Implement NcclCommunicator wrapping cudarc's nccl::Comm to satisfy the Communicator trait for CUDA multi-GPU workloads. Supports all_reduce, broadcast, all_gather, reduce_scatter, send, recv, sync, and barrier. DType dispatch is handled via raw nccl::result FFI to avoid compile-time NcclType generic constraints, covering F32, F64, F16, BF16, FP8E4M3, FP8E5M2, I32, I64, I8, U32, and U8. A new nccl feature flag chains the cuda feature and cudarc's nccl feature behind a single opt-in gate. NcclCommunicator is re-exported from the runtime crate root when the flag is active.
Implement var_rms_norm and var_layer_norm with full gradient support for the autograd system. Both operations use the fused NormalizationOps kernel for the forward pass and compute numerically stable gradients in the backward pass. RMS norm gradients account for the interaction between input and weight via the rstd and x_norm tensors recomputed from saved inputs. Layer norm gradients additionally handle the bias term and subtract the mean of the scaled gradient to satisfy the zero-sum constraint over the normalized dimension. Both var_backward and backward_var paths are implemented, enabling higher-order gradient computation through normalization layers.
Introduce NexarNetCommunicator, which implements the Communicator trait using nexar::SyncClient as the transport layer. This enables inter-node collective operations (allreduce, broadcast, all_gather, reduce_scatter, send, recv, barrier) over QUIC without requiring NCCL or any GPU-specific infrastructure. The implementation is gated behind the nexar feature flag and is intended for CPU-to-CPU inter-node gradient synchronization and tensor parallelism. For intra-node GPU-GPU traffic, NcclCommunicator remains the right choice given NVLink and PCIe bandwidth advantages. DType and ReduceOp mappings cover F32, F64, F16, BF16, integer types, and reject unsupported types with a clear error.
…inter Previously the Tensor API had two pointer accessors: ptr() which returned the raw base storage address, and data_ptr() which returned the offset-adjusted pointer to the first element of the tensor view. This caused widespread confusion where call sites used storage().ptr() instead of data_ptr() and therefore silently operated on the wrong memory address for non-zero-offset views (slices, transposes). Remove data_ptr() and redefine ptr() to always return the offset-adjusted pointer. Update all call sites across ops, runtime helpers, kernels, and sparse operations to use the unified ptr() accessor.
Add two composite activation operations following the impl_generic pattern: log_softmax: computed as x - logsumexp(x, dim) for numerical stability. Implemented in impl_generic/activation.rs and delegated by all three backends (CPU, CUDA, wgpu). Includes LogSoftmaxBackward grad function in the autograd system and var_log_softmax for traced computation. dropout: randomly zeros elements with probability p during training and scales remaining elements by 1/(1-p). Returns input unchanged during inference. Implemented in impl_generic and delegated by all backends. Both operations are exposed via Tensor convenience methods (log_softmax, dropout) and tested with unit tests covering standard cases, edge cases (p=0, p=1), and gradient correctness.
…lvers The iterative solver helpers (vector_norm, vector_dot, update_solution, accumulate_basis_combination, extract_diagonal_inv) and their callers in all GMRES variants, CG, BiCGSTAB, CGS, QMR, MINRES, Lanczos, Arnoldi, Jacobi, SOR, SVDS, AMG, and sparse LU decompositions were using an unconstrained R: Runtime bound. These functions extract scalar values via item() which requires the runtime to use the standard DType. Tighten the bound to R: Runtime<DType = DType> to make this requirement explicit and prevent misuse with non-standard runtime type parameters.
Implement var_silu and SiluBackward for differentiable SiLU (Swish) support in the autograd system. The gradient uses the numerically stable form: sigmoid(x) * (1 + x - silu(x)), avoiding a redundant sigmoid computation by reusing the saved forward output. Also promote ActivationOps from a test-only import to a full import in the activation backward module, since SiluBackward requires it unconditionally.
Implement softplus — log(1 + exp(x)) — across the full stack: - `ActivationOps::softplus` trait method with a default NotImplemented body - `softplus_impl` in impl_generic using the numerically stable form `relu(x) + log(1 + exp(-|x|))` to avoid overflow for large positive inputs - CPU, CUDA, and WebGPU backends delegate to softplus_impl - `var_softplus` autograd op with `SoftplusBackward` gradient node; backward computes sigmoid(x), which is the exact derivative - Tests covering zero, non-zero, large positive/negative, batched input, and non-unit upstream gradients
…nd group support Expand the flat communicator.rs + nexar_communicator.rs into a proper module directory, separating concerns across dedicated files: - traits.rs: Communicator trait and ReduceOp enum - noop.rs: NoOpCommunicator for single-device operation - nexar.rs: NexarNetCommunicator for inter-node QUIC transport - nexar_compat.rs: dtype/op mapping helpers for nexar integration - group.rs: CommunicatorGroup and ParallelDim for tensor/pipeline parallelism - hierarchical.rs: HierarchicalCommunicator combining intra-node NCCL with inter-node nexar for optimal bandwidth utilization Replace the coarse-grained `nexar` feature flag with two finer-grained flags: `distributed` (nexar QUIC transport + tokio runtime) and `distributed-gpu` (distributed + NCCL for intra-node GPU collectives). Add nexar-nccl and tokio as optional dependencies accordingly.
Add `rand_seeded(shape, dtype, seed)` to `RandomOps` for reproducible random number generation. Calling with the same seed and shape always produces the same tensor, enabling deterministic initialization and testing. - Trait: default impl returns `NotImplemented` for graceful degradation - CPU: uses xoshiro256 uniform kernel, all float dtypes supported - CUDA: launches existing rand kernel with explicit seed, FP8 via F32 cast - WebGPU: seed truncated to u32 (WGSL has no native u64); determinism preserved - Tests: reproducibility verified on all three backends; range check [0, 1)
…ual accumulators Extract AVX-512 and AVX2+FMA dot product paths into dedicated `#[target_feature]`-annotated functions so the compiler can optimize each function body fully for its ISA without runtime branching overhead. Both paths now use two independent FMA accumulators interleaved, hiding the 4-5 cycle FMA latency on modern x86 and doubling effective throughput for the GEMV-BT inner loop.
…re annotations Add NEON implementations for gemv_bt_f32 and gemv_bt_f64 on aarch64, processing 4 output columns at a time with vfmaq_f32/vfmaq_f64 FMA instructions. The f32 path unrolls the inner loop 4-wide for better throughput; the f64 path uses dual accumulators to avoid RAW stalls. Extract batch_bf16_to_f32 and batch_f16_to_f32 SIMD inner loops into dedicated functions annotated with #[target_feature(enable = "avx2")] and #[target_feature(enable = "f16c", enable = "avx")] respectively, with explicit scalar fallbacks. This ensures Rust emits the correct target feature guards and prevents UB from calling AVX instructions on CPUs that do not support them. Simplify the AVX-512 i8xi8 dot-product dispatch: SimdLevel::Avx512 is only set when avx512bw is confirmed available, so the redundant is_x86_feature_detected! guard inside the match arm is removed.
…duction Replace single-accumulator loops in the variance phase of fused layer norm and fused RMS norm (AVX2 and AVX512, forward and backward passes, f32 and f64) with a dual-accumulator pattern that processes two SIMD vectors per iteration. Combining the two partial sums with a single vector add at the end allows out-of-order CPUs to issue two independent FMA chains in parallel, eliminating the accumulator RAW dependency that previously serialized throughput to one vector per cycle.
Replace the manual transmute(0u32) no-flags workaround with the proper CUDA_GRAPH_INSTANTIATE_FLAG_AUTO_FREE_ON_LAUNCH constant. Graph-managed memory allocated during capture is freed on each launch, requiring callers to copy output tensors before the next launch. Update the comment to accurately describe the memory lifecycle instead of the previous (incorrect) rationale that justified suppressing the flag to preserve stable device pointers across replays.
Split the monolithic conv.rs into conv1d.rs, conv2d.rs, and conv_common.rs to follow the one-operation-per-file rule. Adds var_conv2d with full backward support (d_input via transposed convolution, d_weight via cross-correlation, d_bias via sum over batch and spatial dims).
Introduce src/runtime/cpu/kernels/rng.rs as numr's own PRNG and distribution sampler, removing the rand and rand_distr crate dependencies from Cargo.toml. All distribution kernels (distributions.rs, memory.rs, quasirandom.rs) now call into this internal module instead of directly using rand APIs.
Remove RandomOps from the TensorOps supertrait bound so random operations are opt-in rather than required by the core tensor interface. Group random op traits (RandomOps, AdvancedRandomOps, QuasiRandomOps, MultivariateRandomOps) into a dedicated re-export block in ops/mod.rs and lib.rs prelude. Fix var_dropout to be exported as a standalone item in autograd/mod.rs, and update the import in tensor_decompose_core.rs accordingly.
…ckends When one operand has a batch dimension of 1, its offset must stay fixed while the other operand advances through its batches. Previously both offsets were incremented unconditionally, so broadcasting a single matrix against a batch produced wrong results. Fix adds per-operand batch counts (a_batch / b_batch) derived from each input's own shape. CPU paths use conditional offset selection; CUDA kernels receive the two counts as extra parameters and compute offsets via modulo, which handles both symmetric and asymmetric broadcast cases uniformly. Affected paths: CPU matmul, CPU semiring_matmul, CUDA matmul_batched, CUDA matmul_bias_batched, CUDA semiring_matmul_batched, and all CUDA GEMV variants (gemv, gemv_bt, gemv_bt_mr).
…tions All public and pub(crate) unsafe fn declarations in the CUDA sparse kernel modules were missing # Safety documentation required by clippy's missing_safety_doc lint. Add precise safety contracts covering device memory validity, element count requirements, index range constraints, and stream lifetime rules for each launcher.
Move DType from the module-level use into the cfg-conditional blocks where it is actually referenced, eliminating unused-import warnings on non-SIMD targets.
Replaces the monolithic reduce.rs (1025 lines) with a focused directory: - common.rs: shared helpers (ensure_contiguous, broadcast utilities) - sum_mean.rs: SumBackward, MeanBackward - extremum.rs: MaxBackward, MinBackward - statistical.rs: remaining statistical reduction gradients
…odules Replaces two large monolithic launcher files with per-operation directories: - index/: gather, scatter, index_select, masked, slice_assign, embedding - sparse_merge/: csr, csc, generic, helpers Each module stays within the 500-line file size limit.
Extends CUDA kernels to handle FP8E4M3 and FP8E5M2 dtypes with F32 accumulation throughout: - fused_add_norm: FP8 fused add+RMSNorm/LayerNorm forward and backward with atomicCAS-based FP8 atomic accumulation for weight gradients - fused_elementwise: FP8 fused_mul_add, fused_add_mul, fused_mul_add_scalar - distance: FP8 cdist/pdist via AccType<fp8> → float specializations - semiring_matmul: F16, BF16, FP8 semiring kernels (compute in F32) - ternary: FP8 instantiations for ternary select kernels - utility: native F16/BF16/FP8 fill values and FP8 arange/linspace support
- cpu/activation: simplify GELU to use tanh op directly, avoiding manual exp-based tanh that overflows in low-precision dtypes - cpu/distance: cdist/pdist promote FP8 inputs to F32 for computation - cuda/gemm_epilogue: FP8 matmul_bias and matmul_bias_residual promote to F32 (tiled GEMM shared-memory path requires native arithmetic) - cuda/normalization: fused_add_layer_norm_bwd promotes FP8 to F32 to avoid precision loss in multi-pass backward with atomic accumulation - cuda/semiring_matmul: allow F16, BF16, FP8 through dtype validation - ops/semiring: fix dtype check logic to correctly return true for F16/BF16/FP8 under their respective feature flags
…nings - Loosen FP8E4M3 tolerance to rtol=0.3/atol=2.5 to accommodate rounding error accumulation in compound ops (norm backward, GEMM) - Prefix unused result bindings with _ in conditional and distance tests
…functions Implement AVX2-vectorized kernels for exp/log, trigonometric functions, hyperbolic functions, reductions, and special functions (erf, gamma, Bessel). Each kernel follows the #[target_feature(enable = "avx2")] pattern with dual accumulators where applicable to hide FMA pipeline latency.
Remove overly specific patch version pins from nexar, nexar-nccl, and paste dependencies, using minor-version constraints instead to allow compatible patch updates.
The import is only used in FP8 code paths, so it should not be unconditionally present. This resolves the unused import warning on non-fp8 builds.
The cpu feature is enabled by default, so passing --features cpu alongside --no-default-features was contradictory. The checks now correctly validate compilation with no features active.
…ess issues Replace vmvnq_u64 with veorq_u64(..., !0) in the NEON softmax kernel since vmvnq_u64 is not available in stable aarch64 intrinsics. Remove exhaustive catch-all arms from match expressions in the unary and special kernels that were unreachable after full variant coverage was added. Prefix unused intermediate NEON reduction variables with underscore to suppress dead-code warnings in cumulative and index kernels. Gate x86_64 microkernel macros and SimdLevel imports behind #[cfg(target_arch = "x86_64")] to avoid unused-import warnings on non-x86 targets. Add #[allow(unreachable_code)] to the scalar SIMD fallback path. Fix Vec type annotation in reduce test to satisfy clippy.
… float type Replace raw f64 casts in the GEMM epilogue backward kernel with a generic AccFloat trait dispatched at runtime. F64 tensors accumulate in f64; all sub-f32 types (F16, BF16) and F32 accumulate in f32, matching standard ML framework practice and avoiding unnecessary precision loss on the hot path.
…ty assert The assertion was referencing `cpu_result` instead of `_cpu_result`, causing a compilation warning and referencing the wrong binding in the WebGPU vs CPU comparison for the where_cond test.
Bump actions/checkout and actions/cache from v4 to v5 across all workflow files (baseline, benchmark, release, test).
Add coverage for features shipped in 0.5.0: - Fused GEMM epilogue (matmul+bias+activation, forward+backward) - Fused activation-mul for gated architectures - Fused add-norm (residual + normalize in one pass) - Fused element-wise operation chains across all backends - i8×i8→i32 and FP8 quantized matmul paths - 2:4 structured sparsity with multi-backend support - slice_assign indexing operation - Seeded deterministic RNG - Expanded autograd differentiable op coverage - CUDA caching allocator and GEMV fast paths
43 tasks
…kward kernel Platform-specific floating-point edge cases in SiLU and Tanh derivative computation could produce NaN or Inf on Windows CI, propagating non-finite gradients through the backward pass. Guard against this by replacing any non-finite derivative value with zero before accumulating into the gradient.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
numr 0.5.0 — 131 commits, 875 files changed, +85k/-28k lines.
Fused Operations
FP8 & Quantized Compute
Sparse
Autograd Expansion
Performance
Runtime & Infrastructure
Architecture
Fixes
vmvnq_u64with correct bitwise NOTTest plan
cargo testpasses (all platforms)cargo test --features f16,sparsepassescargo test --features wgpupassescargo publish --dry-runsucceeds