perf: vectorize CPU bottlenecks with vDSP and cblas by alvgeppetto-debug · Pull Request #32 · maderix/ANE

alvgeppetto-debug · 2026-03-03T19:58:47Z

Vectorize CPU bottlenecks in the training loop using Accelerate framework.

Changes:

Adam optimizer vectorized with vDSP batch ops (vDSP_vsmul, vsma, vsq, vdiv, vvsqrtf, vsadd) in both backward.h and stories_cpu_ops.h
dW accumulation replaced with cblas_sgemm (CblasRowMajor, CblasTrans, CblasNoTrans) in backward.h
dx backward pass replaced with cblas_sgemm in backward.h
Added -framework Accelerate to train target in Makefile (was only on train_large)

Both make train and make train_large compile cleanly on macOS.

- Vectorize adam_update with vDSP batch ops (stories_cpu_ops.h) Replaces scalar per-element loop with vDSP_vsmul/vsma/vsq/vdiv Expected ~3-4x faster for 2.4M parameter updates - Vectorize model_adam_step ADAM_UPDATE macro with vDSP (backward.h) Same batch ops pattern for the train.m model pipeline - Replace cpu_accum_dW with cblas_sgemm (backward.h) dW += dy^T @ x is a standard BLAS GEMM operation Expected 5-10x faster for weight gradient accumulation - Replace cpu_matmul_backward_dx with cblas_sgemm (backward.h) dx = dy @ W^T is also a standard BLAS GEMM - Add -framework Accelerate to train target (Makefile)

alvgeppetto-debug · 2026-03-03T20:01:12Z

Benchmark Results

Hardware: Apple M3 Ultra (28-core CPU, 96 GB RAM, 31.6 TFLOPS peak ANE dual-die)
Config: train_large --steps 30, synthetic 100K tokens

Metric	main (baseline)	cpu-vectorization	Delta
Avg train	101.5 ms/step	96.7 ms/step	-4.7%
Peak batch	99.0 ms/step	87.5 ms/step	-11.6%
ANE TFLOPS	0.92	0.96	+4.3%
Peak ANE TFLOPS	0.94	1.06	+12.8%
Total TFLOPS	1.72	1.80	+4.7%
elem (Adam+embed)	15.0 ms/step	13.1 ms/step	-12.7%

Findings

vDSP Adam vectorization cuts elem time by ~13%
Peak throughput reaches 87.5 ms/step and 1.06 ANE TFLOPS on warmest batch
Wall-clock unchanged at 30 steps (compile-dominated at 77%); training time itself drops 4.8% (2900ms vs 3046ms)
Benefit compounds at higher step counts where compile overhead is amortized

…ion (adapted from upstream PR maderix#32)

dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 4, 2026

[feat] Replace backward pass loops with cblas_sgemm for AMX accelerat…

26062d3

…ion (adapted from upstream PR maderix#32)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: vectorize CPU bottlenecks with vDSP and cblas#32

perf: vectorize CPU bottlenecks with vDSP and cblas#32
alvgeppetto-debug wants to merge 1 commit intomaderix:mainfrom
alvgeppetto-debug:perf/cpu-vectorization

alvgeppetto-debug commented Mar 3, 2026

Uh oh!

alvgeppetto-debug commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alvgeppetto-debug commented Mar 3, 2026

Uh oh!

alvgeppetto-debug commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alvgeppetto-debug commented Mar 3, 2026 •

edited

Loading