Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
132 commits
Select commit Hold shift + click to select a range
4b6709f
feat: enhance tensor layout with Hash trait and comprehensive API
farhan-syah Feb 13, 2026
cbedc0a
chore: bump version to 0.5.0
farhan-syah Feb 13, 2026
0f081df
refactor: extract dtype module into separate files and add DataType t…
farhan-syah Feb 17, 2026
6692cb7
feat: add associated DType type to Runtime trait
farhan-syah Feb 17, 2026
4591abe
refactor: make Tensor generic over Runtime::DType
farhan-syah Feb 17, 2026
9839951
refactor: update operations to use Runtime::DType bounds
farhan-syah Feb 17, 2026
e767407
refactor: update algorithm implementations with Runtime::DType bounds
farhan-syah Feb 17, 2026
b262189
refactor: update autograd system with Runtime::DType bounds
farhan-syah Feb 17, 2026
0b61ea9
refactor: update tests and library exports for dtype system changes
farhan-syah Feb 17, 2026
640fbde
feat(allocator): add TrackingAllocator with stats and reset support
farhan-syah Feb 18, 2026
1fe0556
feat(tensor): add ergonomic accessors and dimension unpacking
farhan-syah Feb 18, 2026
28c18ea
feat(runtime): add Graph trait and CUDA graph capture
farhan-syah Feb 18, 2026
02ff196
refactor(runtime): tighten Runtime::DType bounds to concrete DType
farhan-syah Feb 18, 2026
18a976e
test: add ML dtype audit for reduced-precision types
farhan-syah Feb 18, 2026
3c7ce59
feat(runtime): add Communicator trait for multi-device collective com…
farhan-syah Feb 18, 2026
89a19b7
fix(runtime): recover from mutex poison in TrackingAllocator
farhan-syah Feb 18, 2026
3f945e4
fix(tests): gate ml_dtype_audit items behind f16/fp8 feature flags
farhan-syah Feb 18, 2026
649e6e3
feat(tensor): add zero-copy host slice accessors to Storage
farhan-syah Feb 18, 2026
e3b6850
feat(autograd): add backward support for narrow and cat shape ops
farhan-syah Feb 18, 2026
61ab288
fix: correct mutex poison handling and mutable slice receiver
farhan-syah Feb 18, 2026
b3a6035
feat(indexing): add slice_assign operation across all backends
farhan-syah Feb 18, 2026
ed6a0c3
feat(runtime/cuda): add NCCL-backed communicator for multi-GPU collec…
farhan-syah Feb 19, 2026
81e4f37
feat(autograd): add differentiable rms_norm and layer_norm operations
farhan-syah Feb 19, 2026
72c3041
feat(runtime): add nexar-backed inter-node communicator
farhan-syah Feb 19, 2026
16c89e4
refactor(tensor): consolidate ptr() to return offset-adjusted data po…
farhan-syah Feb 19, 2026
d8cce58
feat(ops): add log_softmax and dropout activation operations
farhan-syah Feb 19, 2026
94ba72d
fix(algorithm): tighten Runtime<DType = DType> bounds in iterative so…
farhan-syah Feb 19, 2026
d8fff34
feat(autograd): add SiLU activation with backward pass
farhan-syah Feb 19, 2026
5c31e08
feat(ops): add softplus activation with autograd support
farhan-syah Feb 19, 2026
26ca4ad
refactor(runtime): split communicator into module with hierarchical a…
farhan-syah Feb 22, 2026
67acfcd
refactor(runtime): consolidate shared utilities into common submodule
farhan-syah Feb 22, 2026
9187449
fix(reduce): treat empty dims as full reduction instead of identity
farhan-syah Feb 22, 2026
942276f
feat(sparse_linalg): add sparse QR factorization with multi-backend s…
farhan-syah Feb 22, 2026
12f5291
fix(tensor): preserve layout offset in reshape for non-zero-offset views
farhan-syah Feb 22, 2026
279919a
feat(autograd): add backward hooks for leaf gradient notifications
farhan-syah Feb 22, 2026
866d48c
feat(autograd): add activation checkpointing
farhan-syah Feb 22, 2026
6d5a381
feat(runtime): add StreamSyncOps for compute-communication overlap
farhan-syah Feb 22, 2026
9c89036
docs(readme): document autograd, normalization, einsum, and sparse li…
farhan-syah Feb 22, 2026
739ba78
feat(autograd): add differentiable dtype cast operation
farhan-syah Feb 23, 2026
4252b02
feat(autograd): add dropout operation with inverted scaling and gradi…
farhan-syah Feb 23, 2026
a900e02
feat(normalization): add group normalization across all backends
farhan-syah Feb 23, 2026
fe0d26d
feat(autograd): add differentiable conv1d with full backward pass
farhan-syah Feb 23, 2026
0bbc463
feat(autograd): add fused SwiGLU activation with autograd support
farhan-syah Feb 23, 2026
a43d689
chore(deps): upgrade cudarc to 0.19 and update client construction API
farhan-syah Feb 23, 2026
19149d4
fix(sparse_qr): correct WGSL binding order and readonly buffer counts
farhan-syah Feb 23, 2026
b8e926b
fix: correct contiguous check, wgpu cat, and doctest annotation
farhan-syah Feb 23, 2026
47ab73d
feat(cpu): extend f16/bf16 SIMD dispatch to all CPU kernels
farhan-syah Feb 23, 2026
47d2549
feat(activation): add fused activation-mul ops for gated architectures
farhan-syah Feb 23, 2026
1af225b
refactor(wgpu): replace dynamic shader generation with static WGSL files
farhan-syah Feb 23, 2026
e29220e
fix(autograd,ops): apply clippy suggestions for idiomatic Rust
farhan-syah Feb 24, 2026
eb4a031
refactor(wgpu): replace runtime shader generation with static WGSL files
farhan-syah Feb 24, 2026
d918a8c
feat(activation): add fused activation-mul CUDA kernels with backward…
farhan-syah Feb 24, 2026
0fc67cc
chore(tests): remove unused imports and dead helper functions in pari…
farhan-syah Feb 24, 2026
69787a2
feat(wgpu/activation): add fused activation-mul forward and backward ops
farhan-syah Feb 24, 2026
c2bba24
feat(norm): add fused add-norm operations with forward and backward p…
farhan-syah Feb 24, 2026
be8abad
perf(softmax): switch to online 2-pass algorithm in SIMD kernels
farhan-syah Feb 24, 2026
c085121
feat(autograd): add softmax_bwd op across CPU, CUDA, and WebGPU
farhan-syah Feb 24, 2026
435e88a
feat(ops): add fused GEMM epilogue with bias and activation
farhan-syah Feb 24, 2026
01b5958
feat(dtype): implement compound assignment operators for complex types
farhan-syah Feb 24, 2026
ac61392
feat(fp8): add FP8 matrix multiplication across all backends
farhan-syah Feb 24, 2026
a8383b7
feat(sparse): add 2:4 structured sparsity with multi-backend support
farhan-syah Feb 24, 2026
6705724
feat(ops): add fused elementwise operations across all backends
farhan-syah Feb 24, 2026
b785066
fix(cuda/distance): use native accumulation type per dtype
farhan-syah Feb 24, 2026
4358b45
feat(cuda/semiring_matmul): add Bool and U8 dtype support
farhan-syah Feb 24, 2026
1190f63
refactor(cuda/normalization): apply Clippy suggestions
farhan-syah Feb 24, 2026
5c6e512
test(backend_parity): add distance, semiring_matmul, conditional, log…
farhan-syah Feb 24, 2026
f58a2ed
fix(softmax): prevent NaN when input contains -inf values
farhan-syah Feb 24, 2026
4077f44
refactor(cuda/activation): extract shared activation helpers into act…
farhan-syah Feb 25, 2026
95b99e9
feat(cuda/gemm_epilogue): implement backward pass for fused matmul-bi…
farhan-syah Feb 25, 2026
f49c3e9
feat(autograd/gemm_epilogue): add var_matmul_bias_activation with bac…
farhan-syah Feb 25, 2026
50b2717
feat(autograd/normalization): add autograd support for fused add-norm…
farhan-syah Feb 25, 2026
5175536
fix(cpu/activation): clamp GELU inner value to prevent tanh exp overflow
farhan-syah Feb 25, 2026
ddff6f7
fix(wgpu/reduce): use valid WGSL literal for i32 minimum value
farhan-syah Feb 25, 2026
d88b143
fix(wgpu/sort): make bitonic sort stable and fix i32 min literal
farhan-syah Feb 25, 2026
50b5869
feat(wgpu/matmul): support N-dimensional tensor multiplication
farhan-syah Feb 25, 2026
2d619ea
feat(cpu/simd): add i32 binary ops and SIMD dot product kernels
farhan-syah Feb 26, 2026
1246e42
perf(norm): replace two-pass reduction with Welford algorithm in laye…
farhan-syah Feb 26, 2026
2defb38
feat(cpu/matmul): add i8×i8→i32 quantized matrix multiplication
farhan-syah Feb 26, 2026
ea56079
fix(cuda/tests): skip tests gracefully when CUDA is unavailable
farhan-syah Feb 26, 2026
f758624
chore: misc cleanups and doc fixes
farhan-syah Feb 26, 2026
64c0e9a
feat(cuda/gemv): add GEMV kernel for small-M matmul dispatch
farhan-syah Feb 27, 2026
83d81c4
feat(cuda): add pipelined D2H copy stream for concurrent GPU execution
farhan-syah Feb 27, 2026
576c2f2
perf(cuda): remove unnecessary stream syncs after broadcast kernel la…
farhan-syah Feb 27, 2026
5e4dedc
feat(cuda/gemv): add transposed-B GEMV kernels for zero-copy weight m…
farhan-syah Feb 28, 2026
aa7a2fc
perf(cpu/matmul): add GEMV-BT fast path for transposed weight matrices
farhan-syah Feb 28, 2026
177ffbe
perf(wgpu/matmul): add GEMV-BT fast path for transposed weight matrices
farhan-syah Feb 28, 2026
f323e5f
fix(tensor): treat size-1 dim strides as irrelevant in is_contiguous
farhan-syah Feb 28, 2026
8c2555d
perf(cuda): replace stream-ordered alloc with Rust-side caching alloc…
farhan-syah Feb 28, 2026
eaf8697
feat(cuda): expose preload_modules on CudaClient for warmup
farhan-syah Mar 1, 2026
0d5b057
perf(cuda/gemv): upgrade transposed-B path to multi-row vectorized ke…
farhan-syah Mar 1, 2026
437bc2d
perf(cpu/matmul): accelerate GEMV-BT for f16/bf16 and large matrices
farhan-syah Mar 1, 2026
aef4ab0
fix(cuda): make strided-copy kernel safe for CUDA graph capture
farhan-syah Mar 2, 2026
68da293
refactor(cuda): route Runtime alloc/dealloc through caching allocator
farhan-syah Mar 2, 2026
731f124
fix(cuda): implement allocator freeze/unfreeze for graph capture
farhan-syah Mar 2, 2026
fd76de4
fix(cpu/rmsnorm): accumulate sum of squares in f64 for numerical prec…
farhan-syah Mar 3, 2026
bb4ea2c
refactor(special): split monolithic mod.rs into constants, helpers, a…
farhan-syah Mar 4, 2026
d44981d
fix(sparse/qr): require caller-supplied host structural data in simpl…
farhan-syah Mar 4, 2026
a439eec
refactor(cpu/simd): extract dispatch logic into dedicated dispatch mo…
farhan-syah Mar 4, 2026
54fbe27
chore(cuda): remove dead recovery helpers and add missing safety docs
farhan-syah Mar 4, 2026
5ed9e02
test(parity): add multivariate distribution tests and fix unused vari…
farhan-syah Mar 4, 2026
4ba6ead
docs(readme): document swiglu, dropout, graph capture, and distribute…
farhan-syah Mar 4, 2026
df36c24
fix(cpu/simd): use absolute crate paths in half_macros to fix dispatc…
farhan-syah Mar 4, 2026
c6db2de
perf(cpu/matmul): add AVX-512 and AVX2+FMA dot product for half-preci…
farhan-syah Mar 4, 2026
53a9a40
refactor: make CPU backend unconditional
farhan-syah Mar 4, 2026
88c3820
fix(tests): suppress unused variable warnings in parity tests
farhan-syah Mar 4, 2026
c516877
fix(tests/semiring_matmul): scope to_vec call inside CUDA block
farhan-syah Mar 4, 2026
e738f3f
feat(random): add seeded uniform random generation across all backends
farhan-syah Mar 5, 2026
78eb577
perf(cpu/matmul): split SIMD dot into target_feature functions with d…
farhan-syah Mar 6, 2026
bdd28cc
perf(cpu): add aarch64 NEON GEMV-BT kernels and fix SIMD target-featu…
farhan-syah Mar 6, 2026
59021d8
perf(cpu/norm): use dual accumulators in AVX2/AVX512 norm variance re…
farhan-syah Mar 6, 2026
40ae4a9
fix(cuda/runtime): use AUTO_FREE_ON_LAUNCH flag for graph capture
farhan-syah Mar 7, 2026
1ac75e3
feat(autograd/conv): add var_conv2d and split conv autograd by dimension
farhan-syah Mar 7, 2026
f876829
refactor(cpu/rng): replace rand/rand_distr deps with internal RNG module
farhan-syah Mar 11, 2026
d19ebdc
refactor(ops): decouple RandomOps from TensorOps and clean up re-exports
farhan-syah Mar 11, 2026
bfc0fb9
fix(ops/matmul): support broadcasting in batched matmul across all ba…
farhan-syah Mar 13, 2026
dbec954
docs(cuda/sparse): add Safety sections to unsafe kernel launcher func…
farhan-syah Mar 13, 2026
9d8ec7e
refactor(cpu/kernels): scope DType import to cfg-gated SIMD blocks
farhan-syah Mar 14, 2026
e887fe7
refactor(autograd/reduce): split reduce.rs into per-operation modules
farhan-syah Mar 14, 2026
0dbad06
refactor(cuda/kernels): split index and sparse_merge launchers into m…
farhan-syah Mar 14, 2026
32e5bd0
feat(cuda/fp8): add FP8 kernel support across CUDA compute paths
farhan-syah Mar 14, 2026
f5a3af3
feat(ops/fp8): extend op dispatch to FP8 dtypes
farhan-syah Mar 14, 2026
7d569f1
fix(tests): adjust FP8E4M3 tolerance and suppress unused variable war…
farhan-syah Mar 14, 2026
60d971c
feat(cpu/simd): add AVX2 math kernels for transcendental and special …
farhan-syah Mar 14, 2026
268b63f
chore(deps): relax patch version pins to minor version constraints
farhan-syah Mar 14, 2026
e1e4ad4
fix(ops/cpu/distance): gate TypeConversionOps import behind fp8 feature
farhan-syah Mar 14, 2026
2b62cf5
ci: remove redundant --features cpu from no-default-features checks
farhan-syah Mar 14, 2026
4479423
fix(cpu/simd): resolve aarch64 NEON compilation warnings and correctn…
farhan-syah Mar 14, 2026
9568b3e
refactor(cpu/gemm): accumulate backward pass in precision-appropriate…
farhan-syah Mar 14, 2026
ba115bf
fix(test/conditional): use correct variable in WebGPU where_cond pari…
farhan-syah Mar 14, 2026
586451d
chore(ci): upgrade GitHub Actions to v5
farhan-syah Mar 14, 2026
08df4cb
docs(readme): document 0.5.0 feature additions
farhan-syah Mar 14, 2026
806596c
fix(cpu/gemm): clamp non-finite activation derivatives to zero in bac…
farhan-syah Mar 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 2 additions & 2 deletions .github/workflows/baseline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
name: Save Benchmark Baseline
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5

- name: Install Rust
uses: dtolnay/rust-toolchain@stable
Expand All @@ -49,7 +49,7 @@ jobs:
# Cache keyed by SHA so each merge gets its own entry.
# benchmark.yml uses restore-keys prefix matching to find the latest one.
- name: Cache baseline
uses: actions/cache/save@v4
uses: actions/cache/save@v5
with:
path: target/fluxbench/baseline.json
key: numr-bench-baseline-${{ github.sha }}
4 changes: 2 additions & 2 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ jobs:
name: Regression Check
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
with:
fetch-depth: 0

Expand All @@ -61,7 +61,7 @@ jobs:
# picks the latest cache entry starting with "numr-bench-baseline-".
- name: Restore baseline from main
id: baseline-cache
uses: actions/cache/restore@v4
uses: actions/cache/restore@v5
with:
path: target/fluxbench/baseline.json
key: numr-bench-baseline-dummy
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
outputs:
version: ${{ steps.version.outputs.version }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5

- name: Install Rust
uses: dtolnay/rust-toolchain@stable
Expand Down Expand Up @@ -71,7 +71,7 @@ jobs:
runs-on: ubuntu-latest
environment: crates-io
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5

- name: Install Rust
uses: dtolnay/rust-toolchain@stable
Expand Down
10 changes: 5 additions & 5 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:
name: Lint, Format & Docs
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5

- name: Install Rust
uses: dtolnay/rust-toolchain@stable
Expand Down Expand Up @@ -56,7 +56,7 @@ jobs:
os: [ubuntu-latest, macos-latest, windows-latest]

steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5

- name: Install Rust
uses: dtolnay/rust-toolchain@stable
Expand All @@ -75,7 +75,7 @@ jobs:
name: Backend Compile, Parity & Examples
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5

- name: Install Rust
uses: dtolnay/rust-toolchain@stable
Expand All @@ -86,7 +86,7 @@ jobs:

# Backend compile gates
- name: "Compile: cpu-only (no default features)"
run: cargo check --no-default-features --features cpu
run: cargo check --no-default-features

- name: "Compile: cpu + f16 + sparse"
run: cargo check --features f16,sparse
Expand All @@ -95,7 +95,7 @@ jobs:
run: cargo check --features wgpu,f16,sparse

- name: "Compile tests: cpu-only"
run: cargo test --no-run --no-default-features --features cpu
run: cargo test --no-run --no-default-features

- name: "Compile tests: wgpu"
run: cargo test --no-run --features wgpu,f16,sparse
Expand Down
33 changes: 20 additions & 13 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "numr"
version = "0.4.0"
version = "0.5.0"
edition = "2024"
rust-version = "1.89"
description = "High-performance numerical computing with multi-backend GPU acceleration (CPU/CUDA/WebGPU)"
Expand All @@ -15,14 +15,20 @@ features = ["f16", "sparse"]
# cuda and wgpu require hardware SDKs not available on docs.rs

[features]
default = ["cpu", "rayon"]
cpu = []
default = ["rayon"]
cuda = ["dep:cudarc"]
nccl = ["cuda", "cudarc?/nccl"]
distributed = ["dep:nexar", "dep:tokio"]
distributed-gpu = ["distributed", "nccl", "dep:nexar-nccl"]
wgpu = ["dep:wgpu", "dep:pollster"]
rayon = ["dep:rayon"]
f16 = ["dep:half", "cudarc?/f16"] # Half-precision floats (F16, BF16) - optional reduced-precision support
fp8 = [] # 8-bit floats (FP8E4M3, FP8E5M2) - optional ultra-low-precision support
sparse = [] # Sparse tensor formats (CSR, CSC, COO) and operations
f16 = [
"dep:half",
"cudarc?/f16",
] # Half-precision floats (F16, BF16) - optional reduced-precision support
fp8 = [
] # 8-bit floats (FP8E4M3, FP8E5M2) - optional ultra-low-precision support
sparse = [] # Sparse tensor formats (CSR, CSC, COO) and operations

[dependencies]
# Core
Expand All @@ -35,11 +41,7 @@ parking_lot = "0.12"
# Optional: Parallelism
rayon = { version = "1.11", optional = true }

# Random number generation (required for rand/randn operations)
rand = "0.9"
rand_distr = "0.5"

# Zero-copy serialization for embedded data
# Zero-copy serialization for embedded data (used by sobol_data)
rkyv = "0.8"

# Optional: Half-precision floats
Expand All @@ -48,15 +50,20 @@ half = { version = "2.7", optional = true, features = [
"num-traits",
] }

# Optional: Inter-node distributed communication
nexar = { version = "0.1", optional = true }
nexar-nccl = { version = "0.1", optional = true }
tokio = { version = "1", features = ["rt"], optional = true }

# Optional: CUDA backend
cudarc = { version = "0.18", optional = true, features = [
cudarc = { version = "0.19", optional = true, features = [
"cuda-version-from-build-system",
] }

# Optional: WebGPU backend
wgpu = { version = "28.0", optional = true }
pollster = { version = "0.4", optional = true }
paste = "1.0.15"
paste = "1.0"

[dev-dependencies]
approx = "0.5"
Expand Down
109 changes: 95 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ numr implements a comprehensive set of tensor operations across CPU, CUDA, and W
### Shape and Data Movement

- **ShapeOps**: cat, stack, split, chunk, repeat, pad, roll
- **IndexingOps**: gather, scatter, gather_nd, scatter_reduce, index_select, masked_select, masked_fill, embedding_lookup, bincount, argmax, argmin
- **IndexingOps**: gather, scatter, gather_nd, scatter_reduce, index_select, masked_select, masked_fill, embedding_lookup, bincount, argmax, argmin, slice_assign
- **SortingOps**: sort, argsort, topk, unique, nonzero, searchsorted

### Reductions
Expand All @@ -106,22 +106,34 @@ numr implements a comprehensive set of tensor operations across CPU, CUDA, and W

### Activation & Normalization Functions

- **ActivationOps**: relu, sigmoid, silu, gelu, leaky_relu, elu, softmax
- **NormalizationOps**: rms_norm, layer_norm
- **ActivationOps**: relu, sigmoid, silu, gelu, swiglu, leaky_relu, elu, softmax, dropout, fused activation-mul (for gated architectures)
- **NormalizationOps**: rms_norm, layer_norm, batch_norm, group_norm, instance_norm, fused add-norm (residual + normalize in one pass)
- **GemmEpilogueOps**: fused matmul+bias+activation in a single kernel (forward + backward)
- **FusedElementwiseOps**: fused element-wise operation chains across all backends
- **ConvOps**: conv1d, conv2d, depthwise_conv2d (with stride, padding, dilation, groups)
- **EinsumOps**: Einstein summation notation

_These are mathematical functions commonly used in ML, but numr itself is not an ML framework._

### Linear Algebra

- **MatmulOps**: matmul, matmul_bias (fused GEMM+bias)
- **MatmulOps**: matmul, matmul_bias (fused GEMM+bias), i8×i8→i32 quantized matmul, FP8 matmul
- **LinalgOps**: solve, lstsq, pinverse, inverse, det, trace, matrix_rank, diag, matrix_norm, kron, khatri_rao
- **ComplexOps**: conj, real, imag, angle (for complex tensor support)

### Automatic Differentiation

- **Reverse-mode**: `Var<R>` tracked tensors, `backward()` for gradient computation
- **Forward-mode**: `jvp()`, `jacobian_forward()` via dual numbers
- **Second-order**: `hvp()` for Hessian-vector products, `backward_with_graph()` for higher-order gradients
- **Activation checkpointing**: `checkpoint()` to trade compute for memory
- **Backward hooks**: `BackwardHook` trait for gradient notifications (e.g., distributed allreduce)
- **Differentiable ops**: matmul, conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add-norm, dtype cast, narrow, cat

### Statistics and Probability

- **StatisticalOps**: var, std, skew, kurtosis, quantile, percentile, median, cov, corrcoef
- **RandomOps**: rand, randn, randint, multinomial, bernoulli, poisson, binomial, beta, gamma, exponential, chi_squared, student_t, f_distribution
- **RandomOps**: rand, randn, randint, multinomial, bernoulli, poisson, binomial, beta, gamma, exponential, chi_squared, student_t, f_distribution (with seeded deterministic generation)
- **MultivariateRandomOps**: multivariate_normal, wishart, dirichlet
- **QuasirandomOps**: Sobol, Halton sequences

Expand Down Expand Up @@ -165,10 +177,38 @@ _These are mathematical functions commonly used in ML, but numr itself is not an

- polyroots, polyval, polyfromroots, polymul

**Iterative Solvers (`numr::iterative`):**

- **Linear solvers**: CG, MINRES, BiCGSTAB, GMRES, LGMRES, CGS, QMR, Jacobi, SOR, Adaptive GMRES
- **Eigensolvers**: Lanczos (symmetric), Arnoldi/IRAM (non-symmetric)
- **Sparse SVD**: via Lanczos bidiagonalization
- **Preconditioners**: ILU(0), IC(0), Algebraic Multigrid (AMG) with V-cycles

**Sparse Tensors (`numr::sparse`, feature-gated):**

- Formats: CSR, CSC, COO
- Operations: SpGEMM (sparse matrix multiplication), SpMV (sparse matrix-vector), DSMM (dense-sparse matrix)
- 2:4 structured sparsity with multi-backend support

**Sparse Linear Algebra (`numr::sparse_linalg`):**

- **Direct solvers**: Sparse LU (Gilbert-Peierls), sparse QR
- **Incomplete factorizations**: ILU(0), ILU(k), IC(0)
- **Preprocessing**: COLAMD ordering, maximum transversal
- **Symbolic/numeric split**: Reuse sparsity structure for repeated solves

**Graph Capture (`numr::runtime`):**

- **`Graph` trait**: Capture a sequence of operations and replay them with zero re-launch overhead
- **CUDA Graphs**: Full capture support—fixed-address buffer replay for inference loops and training steps
- **CPU / WebGPU**: Transparent no-op path; callers write backend-agnostic code using `R::supports_graph_capture()`

**Distributed Computing (`numr::communicator`, feature `nccl`):**

- **`CommunicatorGroup`**: Single-node multi-GPU all-reduce, broadcast, and allgather via NCCL
- **`HierarchicalCommunicator`**: Two-level collective—NCCL intra-node, nexar inter-node
- **`NexarNetCommunicator`**: Pure-Rust distributed transport (QUIC via nexar) for multi-machine tensor parallelism
- **`BackwardHook`**: Autograd hook interface—trigger cross-node gradient synchronization during `backward()`

## Dtypes

Expand Down Expand Up @@ -198,15 +238,15 @@ Every operation supports every compatible dtype. No hardcoded f32-only kernels.

All backends implement identical algorithms with native kernels—no cuBLAS, MKL, or vendor library dependencies.

| Hardware | Backend | Feature | Status | Notes |
| ------------ | ------- | ------------- | ------- | ------------------ |
| CPU (x86-64) | CPU | cpu (default) | ✓ | AVX-512/AVX2 SIMD |
| CPU (ARM64) | CPU | cpu | ✓ | NEON SIMD |
| NVIDIA GPU | CUDA | cuda | ✓ | Native PTX kernels |
| AMD GPU | WebGPU | wgpu | ✓ | WGSL shaders |
| Intel GPU | WebGPU | wgpu | ✓ | WGSL shaders |
| Apple GPU | WebGPU | wgpu | ✓ | WGSL shaders |
| AMD GPU | ROCm | - | Planned | Native HIP kernels |
| Hardware | Backend | Feature | Status | Notes |
| ------------ | ------- | ------------- | ------- | ------------------------------------------------------ |
| CPU (x86-64) | CPU | cpu (default) | ✓ | AVX-512/AVX2 SIMD |
| CPU (ARM64) | CPU | cpu | ✓ | NEON SIMD |
| NVIDIA GPU | CUDA | cuda | ✓ | Native PTX kernels, caching allocator, GEMV fast paths |
| AMD GPU | WebGPU | wgpu | ✓ | WGSL shaders |
| Intel GPU | WebGPU | wgpu | ✓ | WGSL shaders |
| Apple GPU | WebGPU | wgpu | ✓ | WGSL shaders |
| AMD GPU | ROCm | - | Planned | Native HIP kernels |

### SIMD Acceleration

Expand Down Expand Up @@ -443,6 +483,45 @@ fn main() -> Result<()> {
}
```

### Automatic Differentiation

```rust
use numr::prelude::*;
use numr::autograd::*;

fn main() -> Result<()> {
let client = CpuRuntime::client()?;

// Create tracked variables
let x = Var::new(Tensor::<CpuRuntime>::from_slice(&[2.0, 3.0], &[2])?, true);
let w = Var::new(Tensor::<CpuRuntime>::from_slice(&[0.5, -1.0], &[2])?, true);

// Forward pass (builds computation graph)
let y = var_mul(&x, &w, &client)?;
let loss = var_sum(&y, &client)?;

// Backward pass
let grads = backward(&loss, &client)?;
let dx = grads.get(x.tensor()); // gradients w.r.t. x
let dw = grads.get(w.tensor()); // gradients w.r.t. w

// Activation checkpointing (trade compute for memory)
let checkpointed = checkpoint(|inputs| {
let h = var_relu(&inputs[0], &client)?;
var_matmul(&h, &inputs[1], &client)
}, &[&x, &w])?;

// Forward-mode AD (Jacobian-vector products)
let tangent = Tensor::<CpuRuntime>::ones(&[2], &device)?;
let jvp_result = jvp(|x| client.mul(x, x), &x.tensor(), &tangent, &client)?;

// Hessian-vector product
let hvp_result = hvp(|x, c| c.mul(x, x), &x.tensor(), &tangent, &client)?;

Ok(())
}
```

## Installation

### CPU-only (default)
Expand Down Expand Up @@ -484,7 +563,9 @@ numr = { version = "*", features = [
| `wgpu` | Cross-platform GPU (WebGPU) | ✗ |
| `rayon` | Multi-threaded CPU via Rayon | ✓ |
| `f16` | Half-precision floats (F16, BF16) | ✗ |
| `fp8` | FP8 precision (E4M3, E5M2) | ✗ |
| `sparse` | Sparse tensor support (CSR, CSC, COO) | ✗ |
| `nccl` | Multi-GPU communication via NCCL | ✗ |

## Building from Source

Expand Down
Loading