Skip to content

v0.5.0: fused ops, FP8 compute, 2:4 sparsity, autograd expansion#6

Merged
farhan-syah merged 132 commits intomainfrom
0.5.0
Mar 14, 2026
Merged

v0.5.0: fused ops, FP8 compute, 2:4 sparsity, autograd expansion#6
farhan-syah merged 132 commits intomainfrom
0.5.0

Conversation

@farhan-syah
Copy link
Collaborator

Summary

numr 0.5.0 — 131 commits, 875 files changed, +85k/-28k lines.

Fused Operations

  • Fused GEMM epilogue: matmul+bias+activation in a single kernel (forward + backward)
  • Fused activation-mul for gated architectures (SwiGLU, SiLU-mul)
  • Fused add-norm: residual add + normalization in one pass (forward + backward)
  • Fused elementwise operation chains across all backends

FP8 & Quantized Compute

  • FP8 (E4M3/E5M2) matmul across all backends
  • FP8 kernel support across CUDA compute paths
  • i8×i8→i32 quantized matrix multiplication (CPU)

Sparse

  • 2:4 structured sparsity with multi-backend support

Autograd Expansion

  • Differentiable conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add-norm, dtype cast, narrow, cat, gather

Performance

  • CUDA caching allocator (replaces stream-ordered alloc)
  • CUDA pipelined D2H copy for concurrent execution
  • GEMV-BT fast paths across CPU/CUDA/WebGPU (inference-critical)
  • Online softmax in SIMD kernels
  • Welford algorithm for numerically stable variance
  • AVX2 transcendental/special function SIMD kernels
  • Tiled GEMM with dual-accumulator FMA microkernels (AVX2/AVX-512/NEON)
  • Half-precision GEMV-BT acceleration (f16/bf16)

Runtime & Infrastructure

  • CUDA graph capture support
  • NCCL communicator for multi-GPU collectives
  • Nexar inter-node communicator
  • Seeded deterministic RNG across all backends
  • Internal RNG (removed external rand/rand_distr dependency)
  • Slice assign operation across all backends
  • Streaming sync ops for compute-communication overlap

Architecture

  • Runtime::DType associated type (tensors generic over runtime dtype)
  • CPU backend made unconditional
  • Backward pass accumulation in precision-appropriate float type
  • Static WGSL shaders replacing runtime generation
  • Extensive file splits and refactoring

Fixes

  • aarch64 NEON: replaced non-existent vmvnq_u64 with correct bitwise NOT
  • Softmax NaN prevention for -inf inputs
  • Contiguity check for size-1 dim strides
  • CUDA graph capture allocator freeze/unfreeze
  • Batched matmul broadcasting across all backends

Test plan

  • cargo test passes (all platforms)
  • cargo test --features f16,sparse passes
  • cargo test --features wgpu passes
  • Backend parity tests (CPU vs CUDA vs WebGPU)
  • cargo publish --dry-run succeeds
  • CI: Ubuntu, macOS (aarch64), Windows

Add Hash trait to Layout, Shape, and Strides to enable their use in
hash-based collections. Fix contiguity check to correctly identify
strided views that maintain row-major order regardless of offset.

Extend Layout API with methods for common tensor operations:
- Transpose operations (t, transpose_axes)
- Dimension manipulation (squeeze_dim, squeeze_all, unsqueeze_at)
- Flattening and permutation (flatten, permute_dims)
- Advanced indexing (as_strided, index_to_offset, offset_to_index)
- Broadcasting utilities (broadcast_shape, broadcast_shapes)
- Storage calculations (storage_size)

Add From trait implementations for ergonomic layout construction from
tuples, arrays, and slices up to 6 dimensions.
Increment minor version to reflect new tensor layout API features
including Hash trait implementations and comprehensive dimension
manipulation methods.
…rait

Split monolithic dtype/mod.rs into focused modules for better maintainability
and extensibility. Introduces DataType trait to enable downstream libraries
like boostr to define custom dtype enums with quantized variants while
maintaining compatibility with numr's core tensor operations.
Enables runtimes to specify their dtype enum through an associated type,
allowing downstream libraries to extend numr with custom quantized types
while maintaining type safety and backend compatibility.
Updates Tensor to use Runtime's associated DType instead of hardcoded numr::DType,
enabling extensibility for downstream libraries. Reorganizes tensor factory methods
to separate generic DataType operations from concrete DType-specific constructors,
improving code organization and reducing duplication.
Propagates Runtime<DType = DType> bounds throughout operation traits,
implementation helpers, and shape utilities to support the new
extensible dtype system while maintaining backward compatibility.
Propagates dtype trait bounds through linear algebra and polynomial algorithms,
maintaining consistency with the new extensible type system for tensor decomposition,
polynomial operations, and FFT-based convolutions.
Propagates dtype trait bounds through gradient computation and variable
operations, ensuring type safety in automatic differentiation with the
extensible dtype system.
Updates test utilities and backend parity checks to work with the new
DataType trait, ensuring comprehensive validation across CPU, CUDA,
and WebGPU backends with the extensible dtype architecture.
Introduce AllocationStats for profiling allocator behavior and
TrackingAllocator<A> — a generic wrapper that layers thread-safe
tracking on top of any Allocator implementation.

TrackingAllocator records total allocations, total bytes, active
allocation count, peak memory usage (high-water mark), and frozen
state. Cloning shares the same Arc<Mutex<...>> state so that all
handles observe the same counters.

Two new error variants support the allocator lifecycle:
- AllocatorBusy: reset rejected while live allocations exist
- AllocatorFrozen: new allocations rejected while frozen

The Allocator trait gains two defaulted methods:
- stats() -> AllocationStats (zeroed default for non-tracking impls)
- reset() -> Result<()> (no-op default)

Tests cover: basic stat tracking, allocated_bytes(), freeze/unfreeze,
reset success, reset-while-busy rejection, peak across cycles, clone
state sharing, and freeze preservation through reset.
Add a set of commonly needed methods to Tensor<R> that reduce
boilerplate in downstream code.

Ergonomic aliases for existing accessors:
- rank()        -> ndim() alias
- elem_count()  -> numel() alias
- dims()        -> shape() alias returning &[usize]
- len()         -> numel() alias for Iterator/slice parity
- is_empty()    -> true when numel() == 0

Typed dimension access:
- dim(index: isize) -> Result<usize>, negative-index aware
- dims1() through dims5() unpack shape into typed tuples,
  returning ShapeMismatch when the rank does not match

Low-level storage inspection:
- offset()             -> layout offset in elements
- ptr()                -> raw base storage pointer
- data_ptr()           -> ptr + offset * dtype_size (first element)
- owns_memory()        -> whether storage deallocates on drop
- shares_storage_with() -> true when two tensors share a buffer
- ref_count()          -> storage Arc reference count

Construction helper:
- from_storage_contiguous(storage, shape) builds a Tensor directly
  from a Storage handle without going through a client

Deep copy:
- to_bytes()   -> materializes tensor data as raw bytes (contiguous first)
- clone_deep() -> full copy with independent storage
Introduce a Graph trait for capturing and replaying computation sequences,
backed by CUDA Graphs on the CUDA runtime and a no-op eager path on CPU
and WebGPU.

- Add Graph trait with launch() and is_replay_capable() to src/runtime/graph.rs
- Add NoOpGraph for CPU and WebGPU (operations execute eagerly during capture)
- Add CudaGraph wrapping cudarc's CudaGraph behind Arc<Mutex<>> for Send + Sync
- Add Runtime::Graph as a new associated type on the Runtime trait
- Add Runtime::capture_graph() as a required method replacing the stub
- Implement capture_graph() on CpuRuntime (eager), WgpuRuntime (eager), and
  CudaRuntime (real stream capture via cudarc begin_capture/end_capture)
- CUDA implementation correctly ends capture even when the closure fails so
  the stream is never left in capture mode
- Add unit tests for CPU eager execution, error propagation, and NoOpGraph
- Update MockRuntime in external_backend_api.rs to satisfy the new trait bound
Replace bare R: Runtime bounds with R: Runtime<DType = DType> in all
sites that work directly with DType values. This eliminates implicit
assumptions about the associated type and makes each function's
requirements explicit to the type checker.

Affected sites:
- fallback.rs: validate_binary_dtypes, compute_broadcast_shape, all
  fallback op helpers (binary, unary, scalar, reduce, activation,
  softmax, matmul, compare, where_cond, csc/coo elementwise)
- statistics_common.rs: skew_composite, kurtosis_composite
- impl_generic/linalg.rs: triangular_mask_impl, triu_impl, tril_impl,
  slogdet_impl
- impl_generic/utility.rs: one_hot_impl

Also remove an unconditional TypeConversionOps import in cuda/random.rs
that is only needed under the fp8 feature flag, and drop an unused
TypeConversionOps import in cuda/linalg/statistics.rs.
Adds integration tests verifying F16, BF16, FP8E4M3, and FP8E5M2
support across all ML-critical CPU operations: binary, scalar, unary,
reduce, matmul, activations, and normalizations.

Each dtype is audited end-to-end including round-trip casts from F32,
with per-operation pass/fail reporting and a summary assertion to catch
regressions in reduced-precision coverage.
…munication

Introduces a runtime-level abstraction for collective and point-to-point
communication across devices, supporting distributed FFT, parallel linear
algebra, Monte Carlo simulations, and gradient synchronization.

- `Communicator` trait with allreduce, broadcast, allgather, reducescatter,
  and point-to-point send/recv operations over raw device pointers
- `ReduceOp` enum covering Sum, Prod, Min, Max reductions
- `NoOpCommunicator` for single-device operation (world_size=1):
  in-place collectives are true no-ops, separate-buffer collectives
  perform a memcpy, point-to-point ops are no-ops
- Re-export `Communicator`, `NoOpCommunicator`, and `ReduceOp` from
  `runtime` public API
Replace direct `.unwrap()` on Mutex::lock() calls with a private `lock()`
helper that recovers from a poisoned lock via `into_inner()`. A poisoned
lock means another thread panicked while holding it; the tracking counters
may be inconsistent but the inner allocator remains usable, making recovery
safer than propagating a panic to the caller.
All top-level items in ml_dtype_audit.rs are now guarded with
#[cfg(any(feature = "f16", feature = "fp8"))] so the test file compiles
cleanly without those optional features enabled.
Add `as_host_slice` and `as_host_slice_mut` unsafe methods to `Storage<R>`
that return borrowed slices into CPU-backed memory without allocating. Both
methods short-circuit on empty storage and document the safety invariants
required of callers (valid host pointer, no aliasing for the mutable variant).
Implement NarrowBackward and CatBackward gradient functions, enabling
autograd to propagate gradients through tensor slicing and concatenation.

NarrowBackward pads the incoming gradient with zeros to restore the
original shape along the narrowed dimension. CatBackward splits the
output gradient back into per-input slices using narrow, reversing the
concatenation exactly.

Export var_narrow and var_cat from the autograd crate root alongside the
existing shape op exports.
In CudaGraph::launch, recover from a poisoned mutex rather than
panicking, consistent with the existing TrackingAllocator fix.

In Storage::as_host_slice_mut, change the receiver from &self to &mut
self so the mutable slice borrow is sound — a mutable slice must come
from exclusive access to the backing storage.
Add slice_assign to IndexingOps, which copies a source tensor into a
contiguous slice of a destination tensor along a given dimension starting
at a specified index, returning a new tensor with the region replaced.

Implemented natively on all three backends:

- CPU: pointer-based kernel that copies dst then overwrites the slice
  region with src using dispatch_dtype
- CUDA: PTX kernel instantiated for all supported dtypes (f32, f64,
  f16, bf16, i32, i64, fp8_e4m3, fp8_e5m2) via the existing
  launch_slice_assign launcher
- WebGPU: WGSL compute shader generated per dtype (f32, i32, u32) with
  a SliceAssignParams uniform; get_buffer is widened to pub to support
  the bind group wiring

Expose the operation on Tensor<R> via Tensor::slice_assign for
ergonomic use at the call site.
…tives

Implement NcclCommunicator wrapping cudarc's nccl::Comm to satisfy the
Communicator trait for CUDA multi-GPU workloads. Supports all_reduce,
broadcast, all_gather, reduce_scatter, send, recv, sync, and barrier.

DType dispatch is handled via raw nccl::result FFI to avoid compile-time
NcclType generic constraints, covering F32, F64, F16, BF16, FP8E4M3,
FP8E5M2, I32, I64, I8, U32, and U8. A new nccl feature flag chains the
cuda feature and cudarc's nccl feature behind a single opt-in gate.
NcclCommunicator is re-exported from the runtime crate root when the
flag is active.
Implement var_rms_norm and var_layer_norm with full gradient support
for the autograd system. Both operations use the fused NormalizationOps
kernel for the forward pass and compute numerically stable gradients
in the backward pass.

RMS norm gradients account for the interaction between input and weight
via the rstd and x_norm tensors recomputed from saved inputs. Layer norm
gradients additionally handle the bias term and subtract the mean of
the scaled gradient to satisfy the zero-sum constraint over the
normalized dimension.

Both var_backward and backward_var paths are implemented, enabling
higher-order gradient computation through normalization layers.
Introduce NexarNetCommunicator, which implements the Communicator trait
using nexar::SyncClient as the transport layer. This enables inter-node
collective operations (allreduce, broadcast, all_gather, reduce_scatter,
send, recv, barrier) over QUIC without requiring NCCL or any GPU-specific
infrastructure.

The implementation is gated behind the nexar feature flag and is intended
for CPU-to-CPU inter-node gradient synchronization and tensor parallelism.
For intra-node GPU-GPU traffic, NcclCommunicator remains the right choice
given NVLink and PCIe bandwidth advantages.

DType and ReduceOp mappings cover F32, F64, F16, BF16, integer types,
and reject unsupported types with a clear error.
…inter

Previously the Tensor API had two pointer accessors: ptr() which returned
the raw base storage address, and data_ptr() which returned the
offset-adjusted pointer to the first element of the tensor view.

This caused widespread confusion where call sites used storage().ptr()
instead of data_ptr() and therefore silently operated on the wrong memory
address for non-zero-offset views (slices, transposes).

Remove data_ptr() and redefine ptr() to always return the offset-adjusted
pointer. Update all call sites across ops, runtime helpers, kernels, and
sparse operations to use the unified ptr() accessor.
Add two composite activation operations following the impl_generic pattern:

log_softmax: computed as x - logsumexp(x, dim) for numerical stability.
Implemented in impl_generic/activation.rs and delegated by all three
backends (CPU, CUDA, wgpu). Includes LogSoftmaxBackward grad function
in the autograd system and var_log_softmax for traced computation.

dropout: randomly zeros elements with probability p during training and
scales remaining elements by 1/(1-p). Returns input unchanged during
inference. Implemented in impl_generic and delegated by all backends.

Both operations are exposed via Tensor convenience methods (log_softmax,
dropout) and tested with unit tests covering standard cases, edge cases
(p=0, p=1), and gradient correctness.
…lvers

The iterative solver helpers (vector_norm, vector_dot, update_solution,
accumulate_basis_combination, extract_diagonal_inv) and their callers in
all GMRES variants, CG, BiCGSTAB, CGS, QMR, MINRES, Lanczos, Arnoldi,
Jacobi, SOR, SVDS, AMG, and sparse LU decompositions were using an
unconstrained R: Runtime bound.

These functions extract scalar values via item() which requires the runtime
to use the standard DType. Tighten the bound to R: Runtime<DType = DType>
to make this requirement explicit and prevent misuse with non-standard
runtime type parameters.
Implement var_silu and SiluBackward for differentiable SiLU (Swish)
support in the autograd system. The gradient uses the numerically
stable form: sigmoid(x) * (1 + x - silu(x)), avoiding a redundant
sigmoid computation by reusing the saved forward output.

Also promote ActivationOps from a test-only import to a full import
in the activation backward module, since SiluBackward requires it
unconditionally.
Implement softplus — log(1 + exp(x)) — across the full stack:

- `ActivationOps::softplus` trait method with a default NotImplemented body
- `softplus_impl` in impl_generic using the numerically stable form
  `relu(x) + log(1 + exp(-|x|))` to avoid overflow for large positive inputs
- CPU, CUDA, and WebGPU backends delegate to softplus_impl
- `var_softplus` autograd op with `SoftplusBackward` gradient node;
  backward computes sigmoid(x), which is the exact derivative
- Tests covering zero, non-zero, large positive/negative, batched input,
  and non-unit upstream gradients
…nd group support

Expand the flat communicator.rs + nexar_communicator.rs into a proper
module directory, separating concerns across dedicated files:

- traits.rs: Communicator trait and ReduceOp enum
- noop.rs: NoOpCommunicator for single-device operation
- nexar.rs: NexarNetCommunicator for inter-node QUIC transport
- nexar_compat.rs: dtype/op mapping helpers for nexar integration
- group.rs: CommunicatorGroup and ParallelDim for tensor/pipeline parallelism
- hierarchical.rs: HierarchicalCommunicator combining intra-node NCCL
  with inter-node nexar for optimal bandwidth utilization

Replace the coarse-grained `nexar` feature flag with two finer-grained
flags: `distributed` (nexar QUIC transport + tokio runtime) and
`distributed-gpu` (distributed + NCCL for intra-node GPU collectives).
Add nexar-nccl and tokio as optional dependencies accordingly.
Add `rand_seeded(shape, dtype, seed)` to `RandomOps` for reproducible
random number generation. Calling with the same seed and shape always
produces the same tensor, enabling deterministic initialization and
testing.

- Trait: default impl returns `NotImplemented` for graceful degradation
- CPU: uses xoshiro256 uniform kernel, all float dtypes supported
- CUDA: launches existing rand kernel with explicit seed, FP8 via F32 cast
- WebGPU: seed truncated to u32 (WGSL has no native u64); determinism preserved
- Tests: reproducibility verified on all three backends; range check [0, 1)
…ual accumulators

Extract AVX-512 and AVX2+FMA dot product paths into dedicated
`#[target_feature]`-annotated functions so the compiler can optimize
each function body fully for its ISA without runtime branching overhead.

Both paths now use two independent FMA accumulators interleaved, hiding
the 4-5 cycle FMA latency on modern x86 and doubling effective throughput
for the GEMV-BT inner loop.
…re annotations

Add NEON implementations for gemv_bt_f32 and gemv_bt_f64 on aarch64,
processing 4 output columns at a time with vfmaq_f32/vfmaq_f64 FMA
instructions. The f32 path unrolls the inner loop 4-wide for better
throughput; the f64 path uses dual accumulators to avoid RAW stalls.

Extract batch_bf16_to_f32 and batch_f16_to_f32 SIMD inner loops into
dedicated functions annotated with #[target_feature(enable = "avx2")] and
#[target_feature(enable = "f16c", enable = "avx")] respectively, with
explicit scalar fallbacks. This ensures Rust emits the correct target
feature guards and prevents UB from calling AVX instructions on CPUs that
do not support them.

Simplify the AVX-512 i8xi8 dot-product dispatch: SimdLevel::Avx512 is
only set when avx512bw is confirmed available, so the redundant
is_x86_feature_detected! guard inside the match arm is removed.
…duction

Replace single-accumulator loops in the variance phase of fused layer norm
and fused RMS norm (AVX2 and AVX512, forward and backward passes, f32 and
f64) with a dual-accumulator pattern that processes two SIMD vectors per
iteration. Combining the two partial sums with a single vector add at the
end allows out-of-order CPUs to issue two independent FMA chains in
parallel, eliminating the accumulator RAW dependency that previously
serialized throughput to one vector per cycle.
Replace the manual transmute(0u32) no-flags workaround with the proper
CUDA_GRAPH_INSTANTIATE_FLAG_AUTO_FREE_ON_LAUNCH constant. Graph-managed
memory allocated during capture is freed on each launch, requiring
callers to copy output tensors before the next launch.

Update the comment to accurately describe the memory lifecycle instead
of the previous (incorrect) rationale that justified suppressing the
flag to preserve stable device pointers across replays.
Split the monolithic conv.rs into conv1d.rs, conv2d.rs, and conv_common.rs
to follow the one-operation-per-file rule. Adds var_conv2d with full
backward support (d_input via transposed convolution, d_weight via
cross-correlation, d_bias via sum over batch and spatial dims).
Introduce src/runtime/cpu/kernels/rng.rs as numr's own PRNG and
distribution sampler, removing the rand and rand_distr crate
dependencies from Cargo.toml. All distribution kernels
(distributions.rs, memory.rs, quasirandom.rs) now call into this
internal module instead of directly using rand APIs.
Remove RandomOps from the TensorOps supertrait bound so random
operations are opt-in rather than required by the core tensor
interface. Group random op traits (RandomOps, AdvancedRandomOps,
QuasiRandomOps, MultivariateRandomOps) into a dedicated re-export
block in ops/mod.rs and lib.rs prelude. Fix var_dropout to be
exported as a standalone item in autograd/mod.rs, and update the
import in tensor_decompose_core.rs accordingly.
…ckends

When one operand has a batch dimension of 1, its offset must stay fixed
while the other operand advances through its batches. Previously both
offsets were incremented unconditionally, so broadcasting a single matrix
against a batch produced wrong results.

Fix adds per-operand batch counts (a_batch / b_batch) derived from each
input's own shape. CPU paths use conditional offset selection; CUDA kernels
receive the two counts as extra parameters and compute offsets via modulo,
which handles both symmetric and asymmetric broadcast cases uniformly.

Affected paths: CPU matmul, CPU semiring_matmul, CUDA matmul_batched,
CUDA matmul_bias_batched, CUDA semiring_matmul_batched, and all CUDA GEMV
variants (gemv, gemv_bt, gemv_bt_mr).
…tions

All public and pub(crate) unsafe fn declarations in the CUDA sparse kernel
modules were missing # Safety documentation required by clippy's
missing_safety_doc lint. Add precise safety contracts covering device memory
validity, element count requirements, index range constraints, and stream
lifetime rules for each launcher.
Move DType from the module-level use into the cfg-conditional blocks where it
is actually referenced, eliminating unused-import warnings on non-SIMD targets.
Replaces the monolithic reduce.rs (1025 lines) with a focused directory:
- common.rs: shared helpers (ensure_contiguous, broadcast utilities)
- sum_mean.rs: SumBackward, MeanBackward
- extremum.rs: MaxBackward, MinBackward
- statistical.rs: remaining statistical reduction gradients
…odules

Replaces two large monolithic launcher files with per-operation directories:
- index/: gather, scatter, index_select, masked, slice_assign, embedding
- sparse_merge/: csr, csc, generic, helpers

Each module stays within the 500-line file size limit.
Extends CUDA kernels to handle FP8E4M3 and FP8E5M2 dtypes with F32
accumulation throughout:

- fused_add_norm: FP8 fused add+RMSNorm/LayerNorm forward and backward
  with atomicCAS-based FP8 atomic accumulation for weight gradients
- fused_elementwise: FP8 fused_mul_add, fused_add_mul, fused_mul_add_scalar
- distance: FP8 cdist/pdist via AccType<fp8> → float specializations
- semiring_matmul: F16, BF16, FP8 semiring kernels (compute in F32)
- ternary: FP8 instantiations for ternary select kernels
- utility: native F16/BF16/FP8 fill values and FP8 arange/linspace support
- cpu/activation: simplify GELU to use tanh op directly, avoiding
  manual exp-based tanh that overflows in low-precision dtypes
- cpu/distance: cdist/pdist promote FP8 inputs to F32 for computation
- cuda/gemm_epilogue: FP8 matmul_bias and matmul_bias_residual promote
  to F32 (tiled GEMM shared-memory path requires native arithmetic)
- cuda/normalization: fused_add_layer_norm_bwd promotes FP8 to F32 to
  avoid precision loss in multi-pass backward with atomic accumulation
- cuda/semiring_matmul: allow F16, BF16, FP8 through dtype validation
- ops/semiring: fix dtype check logic to correctly return true for
  F16/BF16/FP8 under their respective feature flags
…nings

- Loosen FP8E4M3 tolerance to rtol=0.3/atol=2.5 to accommodate rounding
  error accumulation in compound ops (norm backward, GEMM)
- Prefix unused result bindings with _ in conditional and distance tests
…functions

Implement AVX2-vectorized kernels for exp/log, trigonometric functions,
hyperbolic functions, reductions, and special functions (erf, gamma, Bessel).
Each kernel follows the #[target_feature(enable = "avx2")] pattern with
dual accumulators where applicable to hide FMA pipeline latency.
Remove overly specific patch version pins from nexar, nexar-nccl, and
paste dependencies, using minor-version constraints instead to allow
compatible patch updates.
The import is only used in FP8 code paths, so it should not be
unconditionally present. This resolves the unused import warning on
non-fp8 builds.
The cpu feature is enabled by default, so passing --features cpu alongside
--no-default-features was contradictory. The checks now correctly validate
compilation with no features active.
…ess issues

Replace vmvnq_u64 with veorq_u64(..., !0) in the NEON softmax kernel since
vmvnq_u64 is not available in stable aarch64 intrinsics. Remove exhaustive
catch-all arms from match expressions in the unary and special kernels that
were unreachable after full variant coverage was added. Prefix unused
intermediate NEON reduction variables with underscore to suppress dead-code
warnings in cumulative and index kernels. Gate x86_64 microkernel macros and
SimdLevel imports behind #[cfg(target_arch = "x86_64")] to avoid unused-import
warnings on non-x86 targets. Add #[allow(unreachable_code)] to the scalar
SIMD fallback path. Fix Vec type annotation in reduce test to satisfy clippy.
… float type

Replace raw f64 casts in the GEMM epilogue backward kernel with a generic
AccFloat trait dispatched at runtime. F64 tensors accumulate in f64; all
sub-f32 types (F16, BF16) and F32 accumulate in f32, matching standard ML
framework practice and avoiding unnecessary precision loss on the hot path.
…ty assert

The assertion was referencing `cpu_result` instead of `_cpu_result`, causing
a compilation warning and referencing the wrong binding in the WebGPU vs CPU
comparison for the where_cond test.
Bump actions/checkout and actions/cache from v4 to v5 across all
workflow files (baseline, benchmark, release, test).
Add coverage for features shipped in 0.5.0:
- Fused GEMM epilogue (matmul+bias+activation, forward+backward)
- Fused activation-mul for gated architectures
- Fused add-norm (residual + normalize in one pass)
- Fused element-wise operation chains across all backends
- i8×i8→i32 and FP8 quantized matmul paths
- 2:4 structured sparsity with multi-backend support
- slice_assign indexing operation
- Seeded deterministic RNG
- Expanded autograd differentiable op coverage
- CUDA caching allocator and GEMV fast paths
…kward kernel

Platform-specific floating-point edge cases in SiLU and Tanh derivative
computation could produce NaN or Inf on Windows CI, propagating non-finite
gradients through the backward pass. Guard against this by replacing any
non-finite derivative value with zero before accumulating into the gradient.
@farhan-syah farhan-syah merged commit 671337e into main Mar 14, 2026
11 checks passed
@farhan-syah farhan-syah deleted the 0.5.0 branch March 15, 2026 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0.5.0: Fused ops, FP8 compute, 2:4 sparsity, autograd expansion, production hardening

1 participant