v0.5.0: fused ops, FP8 compute, 2:4 sparsity, autograd expansion by farhan-syah · Pull Request #6 · ml-rust/numr

farhan-syah · 2026-03-14T19:31:58Z

Summary

numr 0.5.0 — 131 commits, 875 files changed, +85k/-28k lines.

Fused Operations

Fused GEMM epilogue: matmul+bias+activation in a single kernel (forward + backward)
Fused activation-mul for gated architectures (SwiGLU, SiLU-mul)
Fused add-norm: residual add + normalization in one pass (forward + backward)
Fused elementwise operation chains across all backends

FP8 & Quantized Compute

FP8 (E4M3/E5M2) matmul across all backends
FP8 kernel support across CUDA compute paths
i8×i8→i32 quantized matrix multiplication (CPU)

Sparse

2:4 structured sparsity with multi-backend support

Autograd Expansion

Differentiable conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add-norm, dtype cast, narrow, cat, gather

Performance

CUDA caching allocator (replaces stream-ordered alloc)
CUDA pipelined D2H copy for concurrent execution
GEMV-BT fast paths across CPU/CUDA/WebGPU (inference-critical)
Online softmax in SIMD kernels
Welford algorithm for numerically stable variance
AVX2 transcendental/special function SIMD kernels
Tiled GEMM with dual-accumulator FMA microkernels (AVX2/AVX-512/NEON)
Half-precision GEMV-BT acceleration (f16/bf16)

Runtime & Infrastructure

CUDA graph capture support
NCCL communicator for multi-GPU collectives
Nexar inter-node communicator
Seeded deterministic RNG across all backends
Internal RNG (removed external rand/rand_distr dependency)
Slice assign operation across all backends
Streaming sync ops for compute-communication overlap

Architecture

Runtime::DType associated type (tensors generic over runtime dtype)
CPU backend made unconditional
Backward pass accumulation in precision-appropriate float type
Static WGSL shaders replacing runtime generation
Extensive file splits and refactoring

Fixes

aarch64 NEON: replaced non-existent vmvnq_u64 with correct bitwise NOT
Softmax NaN prevention for -inf inputs
Contiguity check for size-1 dim strides
CUDA graph capture allocator freeze/unfreeze
Batched matmul broadcasting across all backends

Test plan

cargo test passes (all platforms)
cargo test --features f16,sparse passes
cargo test --features wgpu passes
Backend parity tests (CPU vs CUDA vs WebGPU)
cargo publish --dry-run succeeds
CI: Ubuntu, macOS (aarch64), Windows

Add Hash trait to Layout, Shape, and Strides to enable their use in hash-based collections. Fix contiguity check to correctly identify strided views that maintain row-major order regardless of offset. Extend Layout API with methods for common tensor operations: - Transpose operations (t, transpose_axes) - Dimension manipulation (squeeze_dim, squeeze_all, unsqueeze_at) - Flattening and permutation (flatten, permute_dims) - Advanced indexing (as_strided, index_to_offset, offset_to_index) - Broadcasting utilities (broadcast_shape, broadcast_shapes) - Storage calculations (storage_size) Add From trait implementations for ergonomic layout construction from tuples, arrays, and slices up to 6 dimensions.

Increment minor version to reflect new tensor layout API features including Hash trait implementations and comprehensive dimension manipulation methods.

…rait Split monolithic dtype/mod.rs into focused modules for better maintainability and extensibility. Introduces DataType trait to enable downstream libraries like boostr to define custom dtype enums with quantized variants while maintaining compatibility with numr's core tensor operations.

Enables runtimes to specify their dtype enum through an associated type, allowing downstream libraries to extend numr with custom quantized types while maintaining type safety and backend compatibility.

Updates Tensor to use Runtime's associated DType instead of hardcoded numr::DType, enabling extensibility for downstream libraries. Reorganizes tensor factory methods to separate generic DataType operations from concrete DType-specific constructors, improving code organization and reducing duplication.

Propagates Runtime<DType = DType> bounds throughout operation traits, implementation helpers, and shape utilities to support the new extensible dtype system while maintaining backward compatibility.

Propagates dtype trait bounds through linear algebra and polynomial algorithms, maintaining consistency with the new extensible type system for tensor decomposition, polynomial operations, and FFT-based convolutions.

Propagates dtype trait bounds through gradient computation and variable operations, ensuring type safety in automatic differentiation with the extensible dtype system.

Updates test utilities and backend parity checks to work with the new DataType trait, ensuring comprehensive validation across CPU, CUDA, and WebGPU backends with the extensible dtype architecture.

Introduce AllocationStats for profiling allocator behavior and TrackingAllocator<A> — a generic wrapper that layers thread-safe tracking on top of any Allocator implementation. TrackingAllocator records total allocations, total bytes, active allocation count, peak memory usage (high-water mark), and frozen state. Cloning shares the same Arc<Mutex<...>> state so that all handles observe the same counters. Two new error variants support the allocator lifecycle: - AllocatorBusy: reset rejected while live allocations exist - AllocatorFrozen: new allocations rejected while frozen The Allocator trait gains two defaulted methods: - stats() -> AllocationStats (zeroed default for non-tracking impls) - reset() -> Result<()> (no-op default) Tests cover: basic stat tracking, allocated_bytes(), freeze/unfreeze, reset success, reset-while-busy rejection, peak across cycles, clone state sharing, and freeze preservation through reset.

Add a set of commonly needed methods to Tensor<R> that reduce boilerplate in downstream code. Ergonomic aliases for existing accessors: - rank() -> ndim() alias - elem_count() -> numel() alias - dims() -> shape() alias returning &[usize] - len() -> numel() alias for Iterator/slice parity - is_empty() -> true when numel() == 0 Typed dimension access: - dim(index: isize) -> Result<usize>, negative-index aware - dims1() through dims5() unpack shape into typed tuples, returning ShapeMismatch when the rank does not match Low-level storage inspection: - offset() -> layout offset in elements - ptr() -> raw base storage pointer - data_ptr() -> ptr + offset * dtype_size (first element) - owns_memory() -> whether storage deallocates on drop - shares_storage_with() -> true when two tensors share a buffer - ref_count() -> storage Arc reference count Construction helper: - from_storage_contiguous(storage, shape) builds a Tensor directly from a Storage handle without going through a client Deep copy: - to_bytes() -> materializes tensor data as raw bytes (contiguous first) - clone_deep() -> full copy with independent storage

Introduce a Graph trait for capturing and replaying computation sequences, backed by CUDA Graphs on the CUDA runtime and a no-op eager path on CPU and WebGPU. - Add Graph trait with launch() and is_replay_capable() to src/runtime/graph.rs - Add NoOpGraph for CPU and WebGPU (operations execute eagerly during capture) - Add CudaGraph wrapping cudarc's CudaGraph behind Arc<Mutex<>> for Send + Sync - Add Runtime::Graph as a new associated type on the Runtime trait - Add Runtime::capture_graph() as a required method replacing the stub - Implement capture_graph() on CpuRuntime (eager), WgpuRuntime (eager), and CudaRuntime (real stream capture via cudarc begin_capture/end_capture) - CUDA implementation correctly ends capture even when the closure fails so the stream is never left in capture mode - Add unit tests for CPU eager execution, error propagation, and NoOpGraph - Update MockRuntime in external_backend_api.rs to satisfy the new trait bound

Replace bare R: Runtime bounds with R: Runtime<DType = DType> in all sites that work directly with DType values. This eliminates implicit assumptions about the associated type and makes each function's requirements explicit to the type checker. Affected sites: - fallback.rs: validate_binary_dtypes, compute_broadcast_shape, all fallback op helpers (binary, unary, scalar, reduce, activation, softmax, matmul, compare, where_cond, csc/coo elementwise) - statistics_common.rs: skew_composite, kurtosis_composite - impl_generic/linalg.rs: triangular_mask_impl, triu_impl, tril_impl, slogdet_impl - impl_generic/utility.rs: one_hot_impl Also remove an unconditional TypeConversionOps import in cuda/random.rs that is only needed under the fp8 feature flag, and drop an unused TypeConversionOps import in cuda/linalg/statistics.rs.

Adds integration tests verifying F16, BF16, FP8E4M3, and FP8E5M2 support across all ML-critical CPU operations: binary, scalar, unary, reduce, matmul, activations, and normalizations. Each dtype is audited end-to-end including round-trip casts from F32, with per-operation pass/fail reporting and a summary assertion to catch regressions in reduced-precision coverage.

…munication Introduces a runtime-level abstraction for collective and point-to-point communication across devices, supporting distributed FFT, parallel linear algebra, Monte Carlo simulations, and gradient synchronization. - `Communicator` trait with allreduce, broadcast, allgather, reducescatter, and point-to-point send/recv operations over raw device pointers - `ReduceOp` enum covering Sum, Prod, Min, Max reductions - `NoOpCommunicator` for single-device operation (world_size=1): in-place collectives are true no-ops, separate-buffer collectives perform a memcpy, point-to-point ops are no-ops - Re-export `Communicator`, `NoOpCommunicator`, and `ReduceOp` from `runtime` public API

Replace direct `.unwrap()` on Mutex::lock() calls with a private `lock()` helper that recovers from a poisoned lock via `into_inner()`. A poisoned lock means another thread panicked while holding it; the tracking counters may be inconsistent but the inner allocator remains usable, making recovery safer than propagating a panic to the caller.

All top-level items in ml_dtype_audit.rs are now guarded with #[cfg(any(feature = "f16", feature = "fp8"))] so the test file compiles cleanly without those optional features enabled.

Add `as_host_slice` and `as_host_slice_mut` unsafe methods to `Storage<R>` that return borrowed slices into CPU-backed memory without allocating. Both methods short-circuit on empty storage and document the safety invariants required of callers (valid host pointer, no aliasing for the mutable variant).

Implement NarrowBackward and CatBackward gradient functions, enabling autograd to propagate gradients through tensor slicing and concatenation. NarrowBackward pads the incoming gradient with zeros to restore the original shape along the narrowed dimension. CatBackward splits the output gradient back into per-input slices using narrow, reversing the concatenation exactly. Export var_narrow and var_cat from the autograd crate root alongside the existing shape op exports.

In CudaGraph::launch, recover from a poisoned mutex rather than panicking, consistent with the existing TrackingAllocator fix. In Storage::as_host_slice_mut, change the receiver from &self to &mut self so the mutable slice borrow is sound — a mutable slice must come from exclusive access to the backing storage.

Add slice_assign to IndexingOps, which copies a source tensor into a contiguous slice of a destination tensor along a given dimension starting at a specified index, returning a new tensor with the region replaced. Implemented natively on all three backends: - CPU: pointer-based kernel that copies dst then overwrites the slice region with src using dispatch_dtype - CUDA: PTX kernel instantiated for all supported dtypes (f32, f64, f16, bf16, i32, i64, fp8_e4m3, fp8_e5m2) via the existing launch_slice_assign launcher - WebGPU: WGSL compute shader generated per dtype (f32, i32, u32) with a SliceAssignParams uniform; get_buffer is widened to pub to support the bind group wiring Expose the operation on Tensor<R> via Tensor::slice_assign for ergonomic use at the call site.

…tives Implement NcclCommunicator wrapping cudarc's nccl::Comm to satisfy the Communicator trait for CUDA multi-GPU workloads. Supports all_reduce, broadcast, all_gather, reduce_scatter, send, recv, sync, and barrier. DType dispatch is handled via raw nccl::result FFI to avoid compile-time NcclType generic constraints, covering F32, F64, F16, BF16, FP8E4M3, FP8E5M2, I32, I64, I8, U32, and U8. A new nccl feature flag chains the cuda feature and cudarc's nccl feature behind a single opt-in gate. NcclCommunicator is re-exported from the runtime crate root when the flag is active.

Implement var_rms_norm and var_layer_norm with full gradient support for the autograd system. Both operations use the fused NormalizationOps kernel for the forward pass and compute numerically stable gradients in the backward pass. RMS norm gradients account for the interaction between input and weight via the rstd and x_norm tensors recomputed from saved inputs. Layer norm gradients additionally handle the bias term and subtract the mean of the scaled gradient to satisfy the zero-sum constraint over the normalized dimension. Both var_backward and backward_var paths are implemented, enabling higher-order gradient computation through normalization layers.

Introduce NexarNetCommunicator, which implements the Communicator trait using nexar::SyncClient as the transport layer. This enables inter-node collective operations (allreduce, broadcast, all_gather, reduce_scatter, send, recv, barrier) over QUIC without requiring NCCL or any GPU-specific infrastructure. The implementation is gated behind the nexar feature flag and is intended for CPU-to-CPU inter-node gradient synchronization and tensor parallelism. For intra-node GPU-GPU traffic, NcclCommunicator remains the right choice given NVLink and PCIe bandwidth advantages. DType and ReduceOp mappings cover F32, F64, F16, BF16, integer types, and reject unsupported types with a clear error.

…inter Previously the Tensor API had two pointer accessors: ptr() which returned the raw base storage address, and data_ptr() which returned the offset-adjusted pointer to the first element of the tensor view. This caused widespread confusion where call sites used storage().ptr() instead of data_ptr() and therefore silently operated on the wrong memory address for non-zero-offset views (slices, transposes). Remove data_ptr() and redefine ptr() to always return the offset-adjusted pointer. Update all call sites across ops, runtime helpers, kernels, and sparse operations to use the unified ptr() accessor.

Add two composite activation operations following the impl_generic pattern: log_softmax: computed as x - logsumexp(x, dim) for numerical stability. Implemented in impl_generic/activation.rs and delegated by all three backends (CPU, CUDA, wgpu). Includes LogSoftmaxBackward grad function in the autograd system and var_log_softmax for traced computation. dropout: randomly zeros elements with probability p during training and scales remaining elements by 1/(1-p). Returns input unchanged during inference. Implemented in impl_generic and delegated by all backends. Both operations are exposed via Tensor convenience methods (log_softmax, dropout) and tested with unit tests covering standard cases, edge cases (p=0, p=1), and gradient correctness.

…lvers The iterative solver helpers (vector_norm, vector_dot, update_solution, accumulate_basis_combination, extract_diagonal_inv) and their callers in all GMRES variants, CG, BiCGSTAB, CGS, QMR, MINRES, Lanczos, Arnoldi, Jacobi, SOR, SVDS, AMG, and sparse LU decompositions were using an unconstrained R: Runtime bound. These functions extract scalar values via item() which requires the runtime to use the standard DType. Tighten the bound to R: Runtime<DType = DType> to make this requirement explicit and prevent misuse with non-standard runtime type parameters.

Implement var_silu and SiluBackward for differentiable SiLU (Swish) support in the autograd system. The gradient uses the numerically stable form: sigmoid(x) * (1 + x - silu(x)), avoiding a redundant sigmoid computation by reusing the saved forward output. Also promote ActivationOps from a test-only import to a full import in the activation backward module, since SiluBackward requires it unconditionally.

Implement softplus — log(1 + exp(x)) — across the full stack: - `ActivationOps::softplus` trait method with a default NotImplemented body - `softplus_impl` in impl_generic using the numerically stable form `relu(x) + log(1 + exp(-|x|))` to avoid overflow for large positive inputs - CPU, CUDA, and WebGPU backends delegate to softplus_impl - `var_softplus` autograd op with `SoftplusBackward` gradient node; backward computes sigmoid(x), which is the exact derivative - Tests covering zero, non-zero, large positive/negative, batched input, and non-unit upstream gradients

…nd group support Expand the flat communicator.rs + nexar_communicator.rs into a proper module directory, separating concerns across dedicated files: - traits.rs: Communicator trait and ReduceOp enum - noop.rs: NoOpCommunicator for single-device operation - nexar.rs: NexarNetCommunicator for inter-node QUIC transport - nexar_compat.rs: dtype/op mapping helpers for nexar integration - group.rs: CommunicatorGroup and ParallelDim for tensor/pipeline parallelism - hierarchical.rs: HierarchicalCommunicator combining intra-node NCCL with inter-node nexar for optimal bandwidth utilization Replace the coarse-grained `nexar` feature flag with two finer-grained flags: `distributed` (nexar QUIC transport + tokio runtime) and `distributed-gpu` (distributed + NCCL for intra-node GPU collectives). Add nexar-nccl and tokio as optional dependencies accordingly.

Add `rand_seeded(shape, dtype, seed)` to `RandomOps` for reproducible random number generation. Calling with the same seed and shape always produces the same tensor, enabling deterministic initialization and testing. - Trait: default impl returns `NotImplemented` for graceful degradation - CPU: uses xoshiro256 uniform kernel, all float dtypes supported - CUDA: launches existing rand kernel with explicit seed, FP8 via F32 cast - WebGPU: seed truncated to u32 (WGSL has no native u64); determinism preserved - Tests: reproducibility verified on all three backends; range check [0, 1)

…ual accumulators Extract AVX-512 and AVX2+FMA dot product paths into dedicated `#[target_feature]`-annotated functions so the compiler can optimize each function body fully for its ISA without runtime branching overhead. Both paths now use two independent FMA accumulators interleaved, hiding the 4-5 cycle FMA latency on modern x86 and doubling effective throughput for the GEMV-BT inner loop.

…re annotations Add NEON implementations for gemv_bt_f32 and gemv_bt_f64 on aarch64, processing 4 output columns at a time with vfmaq_f32/vfmaq_f64 FMA instructions. The f32 path unrolls the inner loop 4-wide for better throughput; the f64 path uses dual accumulators to avoid RAW stalls. Extract batch_bf16_to_f32 and batch_f16_to_f32 SIMD inner loops into dedicated functions annotated with #[target_feature(enable = "avx2")] and #[target_feature(enable = "f16c", enable = "avx")] respectively, with explicit scalar fallbacks. This ensures Rust emits the correct target feature guards and prevents UB from calling AVX instructions on CPUs that do not support them. Simplify the AVX-512 i8xi8 dot-product dispatch: SimdLevel::Avx512 is only set when avx512bw is confirmed available, so the redundant is_x86_feature_detected! guard inside the match arm is removed.

…duction Replace single-accumulator loops in the variance phase of fused layer norm and fused RMS norm (AVX2 and AVX512, forward and backward passes, f32 and f64) with a dual-accumulator pattern that processes two SIMD vectors per iteration. Combining the two partial sums with a single vector add at the end allows out-of-order CPUs to issue two independent FMA chains in parallel, eliminating the accumulator RAW dependency that previously serialized throughput to one vector per cycle.

Replace the manual transmute(0u32) no-flags workaround with the proper CUDA_GRAPH_INSTANTIATE_FLAG_AUTO_FREE_ON_LAUNCH constant. Graph-managed memory allocated during capture is freed on each launch, requiring callers to copy output tensors before the next launch. Update the comment to accurately describe the memory lifecycle instead of the previous (incorrect) rationale that justified suppressing the flag to preserve stable device pointers across replays.

Split the monolithic conv.rs into conv1d.rs, conv2d.rs, and conv_common.rs to follow the one-operation-per-file rule. Adds var_conv2d with full backward support (d_input via transposed convolution, d_weight via cross-correlation, d_bias via sum over batch and spatial dims).

Introduce src/runtime/cpu/kernels/rng.rs as numr's own PRNG and distribution sampler, removing the rand and rand_distr crate dependencies from Cargo.toml. All distribution kernels (distributions.rs, memory.rs, quasirandom.rs) now call into this internal module instead of directly using rand APIs.

Remove RandomOps from the TensorOps supertrait bound so random operations are opt-in rather than required by the core tensor interface. Group random op traits (RandomOps, AdvancedRandomOps, QuasiRandomOps, MultivariateRandomOps) into a dedicated re-export block in ops/mod.rs and lib.rs prelude. Fix var_dropout to be exported as a standalone item in autograd/mod.rs, and update the import in tensor_decompose_core.rs accordingly.

…ckends When one operand has a batch dimension of 1, its offset must stay fixed while the other operand advances through its batches. Previously both offsets were incremented unconditionally, so broadcasting a single matrix against a batch produced wrong results. Fix adds per-operand batch counts (a_batch / b_batch) derived from each input's own shape. CPU paths use conditional offset selection; CUDA kernels receive the two counts as extra parameters and compute offsets via modulo, which handles both symmetric and asymmetric broadcast cases uniformly. Affected paths: CPU matmul, CPU semiring_matmul, CUDA matmul_batched, CUDA matmul_bias_batched, CUDA semiring_matmul_batched, and all CUDA GEMV variants (gemv, gemv_bt, gemv_bt_mr).

…tions All public and pub(crate) unsafe fn declarations in the CUDA sparse kernel modules were missing # Safety documentation required by clippy's missing_safety_doc lint. Add precise safety contracts covering device memory validity, element count requirements, index range constraints, and stream lifetime rules for each launcher.

Move DType from the module-level use into the cfg-conditional blocks where it is actually referenced, eliminating unused-import warnings on non-SIMD targets.

Replaces the monolithic reduce.rs (1025 lines) with a focused directory: - common.rs: shared helpers (ensure_contiguous, broadcast utilities) - sum_mean.rs: SumBackward, MeanBackward - extremum.rs: MaxBackward, MinBackward - statistical.rs: remaining statistical reduction gradients

…odules Replaces two large monolithic launcher files with per-operation directories: - index/: gather, scatter, index_select, masked, slice_assign, embedding - sparse_merge/: csr, csc, generic, helpers Each module stays within the 500-line file size limit.

Extends CUDA kernels to handle FP8E4M3 and FP8E5M2 dtypes with F32 accumulation throughout: - fused_add_norm: FP8 fused add+RMSNorm/LayerNorm forward and backward with atomicCAS-based FP8 atomic accumulation for weight gradients - fused_elementwise: FP8 fused_mul_add, fused_add_mul, fused_mul_add_scalar - distance: FP8 cdist/pdist via AccType<fp8> → float specializations - semiring_matmul: F16, BF16, FP8 semiring kernels (compute in F32) - ternary: FP8 instantiations for ternary select kernels - utility: native F16/BF16/FP8 fill values and FP8 arange/linspace support

- cpu/activation: simplify GELU to use tanh op directly, avoiding manual exp-based tanh that overflows in low-precision dtypes - cpu/distance: cdist/pdist promote FP8 inputs to F32 for computation - cuda/gemm_epilogue: FP8 matmul_bias and matmul_bias_residual promote to F32 (tiled GEMM shared-memory path requires native arithmetic) - cuda/normalization: fused_add_layer_norm_bwd promotes FP8 to F32 to avoid precision loss in multi-pass backward with atomic accumulation - cuda/semiring_matmul: allow F16, BF16, FP8 through dtype validation - ops/semiring: fix dtype check logic to correctly return true for F16/BF16/FP8 under their respective feature flags

…nings - Loosen FP8E4M3 tolerance to rtol=0.3/atol=2.5 to accommodate rounding error accumulation in compound ops (norm backward, GEMM) - Prefix unused result bindings with _ in conditional and distance tests

…functions Implement AVX2-vectorized kernels for exp/log, trigonometric functions, hyperbolic functions, reductions, and special functions (erf, gamma, Bessel). Each kernel follows the #[target_feature(enable = "avx2")] pattern with dual accumulators where applicable to hide FMA pipeline latency.

Remove overly specific patch version pins from nexar, nexar-nccl, and paste dependencies, using minor-version constraints instead to allow compatible patch updates.

The import is only used in FP8 code paths, so it should not be unconditionally present. This resolves the unused import warning on non-fp8 builds.

The cpu feature is enabled by default, so passing --features cpu alongside --no-default-features was contradictory. The checks now correctly validate compilation with no features active.

…ess issues Replace vmvnq_u64 with veorq_u64(..., !0) in the NEON softmax kernel since vmvnq_u64 is not available in stable aarch64 intrinsics. Remove exhaustive catch-all arms from match expressions in the unary and special kernels that were unreachable after full variant coverage was added. Prefix unused intermediate NEON reduction variables with underscore to suppress dead-code warnings in cumulative and index kernels. Gate x86_64 microkernel macros and SimdLevel imports behind #[cfg(target_arch = "x86_64")] to avoid unused-import warnings on non-x86 targets. Add #[allow(unreachable_code)] to the scalar SIMD fallback path. Fix Vec type annotation in reduce test to satisfy clippy.

… float type Replace raw f64 casts in the GEMM epilogue backward kernel with a generic AccFloat trait dispatched at runtime. F64 tensors accumulate in f64; all sub-f32 types (F16, BF16) and F32 accumulate in f32, matching standard ML framework practice and avoiding unnecessary precision loss on the hot path.

…ty assert The assertion was referencing `cpu_result` instead of `_cpu_result`, causing a compilation warning and referencing the wrong binding in the WebGPU vs CPU comparison for the where_cond test.

Bump actions/checkout and actions/cache from v4 to v5 across all workflow files (baseline, benchmark, release, test).

Add coverage for features shipped in 0.5.0: - Fused GEMM epilogue (matmul+bias+activation, forward+backward) - Fused activation-mul for gated architectures - Fused add-norm (residual + normalize in one pass) - Fused element-wise operation chains across all backends - i8×i8→i32 and FP8 quantized matmul paths - 2:4 structured sparsity with multi-backend support - slice_assign indexing operation - Seeded deterministic RNG - Expanded autograd differentiable op coverage - CUDA caching allocator and GEMV fast paths

…kward kernel Platform-specific floating-point edge cases in SiLU and Tanh derivative computation could produce NaN or Inf on Windows CI, propagating non-finite gradients through the backward pass. Guard against this by replacing any non-finite derivative value with zero before accumulating into the gradient.

farhan-syah added 30 commits February 14, 2026 05:44

chore: bump version to 0.5.0

cbedc0a

Increment minor version to reflect new tensor layout API features including Hash trait implementations and comprehensive dimension manipulation methods.

feat: add associated DType type to Runtime trait

6692cb7

Enables runtimes to specify their dtype enum through an associated type, allowing downstream libraries to extend numr with custom quantized types while maintaining type safety and backend compatibility.

refactor: update operations to use Runtime::DType bounds

9839951

Propagates Runtime<DType = DType> bounds throughout operation traits, implementation helpers, and shape utilities to support the new extensible dtype system while maintaining backward compatibility.

refactor: update algorithm implementations with Runtime::DType bounds

e767407

Propagates dtype trait bounds through linear algebra and polynomial algorithms, maintaining consistency with the new extensible type system for tensor decomposition, polynomial operations, and FFT-based convolutions.

refactor: update autograd system with Runtime::DType bounds

b262189

Propagates dtype trait bounds through gradient computation and variable operations, ensuring type safety in automatic differentiation with the extensible dtype system.

refactor: update tests and library exports for dtype system changes

0b61ea9

Updates test utilities and backend parity checks to work with the new DataType trait, ensuring comprehensive validation across CPU, CUDA, and WebGPU backends with the extensible dtype architecture.

fix(tests): gate ml_dtype_audit items behind f16/fp8 feature flags

3f945e4

All top-level items in ml_dtype_audit.rs are now guarded with #[cfg(any(feature = "f16", feature = "fp8"))] so the test file compiles cleanly without those optional features enabled.

farhan-syah added 25 commits March 5, 2026 18:07

refactor(cpu/kernels): scope DType import to cfg-gated SIMD blocks

9d8ec7e

Move DType from the module-level use into the cfg-conditional blocks where it is actually referenced, eliminating unused-import warnings on non-SIMD targets.

fix(tests): adjust FP8E4M3 tolerance and suppress unused variable war…

7d569f1

…nings - Loosen FP8E4M3 tolerance to rtol=0.3/atol=2.5 to accommodate rounding error accumulation in compound ops (norm backward, GEMM) - Prefix unused result bindings with _ in conditional and distance tests

chore(deps): relax patch version pins to minor version constraints

268b63f

Remove overly specific patch version pins from nexar, nexar-nccl, and paste dependencies, using minor-version constraints instead to allow compatible patch updates.

fix(ops/cpu/distance): gate TypeConversionOps import behind fp8 feature

e1e4ad4

The import is only used in FP8 code paths, so it should not be unconditionally present. This resolves the unused import warning on non-fp8 builds.

ci: remove redundant --features cpu from no-default-features checks

2b62cf5

The cpu feature is enabled by default, so passing --features cpu alongside --no-default-features was contradictory. The checks now correctly validate compilation with no features active.

fix(test/conditional): use correct variable in WebGPU where_cond pari…

ba115bf

…ty assert The assertion was referencing `cpu_result` instead of `_cpu_result`, causing a compilation warning and referencing the wrong binding in the WebGPU vs CPU comparison for the where_cond test.

chore(ci): upgrade GitHub Actions to v5

586451d

Bump actions/checkout and actions/cache from v4 to v5 across all workflow files (baseline, benchmark, release, test).

farhan-syah mentioned this pull request Mar 14, 2026

0.5.0: Fused ops, FP8 compute, 2:4 sparsity, autograd expansion, production hardening #5

Closed

43 tasks

farhan-syah linked an issue Mar 14, 2026 that may be closed by this pull request

0.5.0: Fused ops, FP8 compute, 2:4 sparsity, autograd expansion, production hardening #5

Closed

43 tasks

farhan-syah merged commit 671337e into main Mar 14, 2026
11 checks passed

farhan-syah deleted the 0.5.0 branch March 15, 2026 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0: fused ops, FP8 compute, 2:4 sparsity, autograd expansion#6

v0.5.0: fused ops, FP8 compute, 2:4 sparsity, autograd expansion#6
farhan-syah merged 132 commits intomainfrom
0.5.0

farhan-syah commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

farhan-syah commented Mar 14, 2026

Summary

Fused Operations

FP8 & Quantized Compute

Sparse

Autograd Expansion

Performance

Runtime & Infrastructure

Architecture

Fixes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant