Skip to content

perf: vectorize CPU bottlenecks with vDSP and cblas#32

Open
alvgeppetto-debug wants to merge 1 commit intomaderix:mainfrom
alvgeppetto-debug:perf/cpu-vectorization
Open

perf: vectorize CPU bottlenecks with vDSP and cblas#32
alvgeppetto-debug wants to merge 1 commit intomaderix:mainfrom
alvgeppetto-debug:perf/cpu-vectorization

Conversation

@alvgeppetto-debug
Copy link

Vectorize CPU bottlenecks in the training loop using Accelerate framework.

Changes:

  • Adam optimizer vectorized with vDSP batch ops (vDSP_vsmul, vsma, vsq, vdiv, vvsqrtf, vsadd) in both backward.h and stories_cpu_ops.h
  • dW accumulation replaced with cblas_sgemm (CblasRowMajor, CblasTrans, CblasNoTrans) in backward.h
  • dx backward pass replaced with cblas_sgemm in backward.h
  • Added -framework Accelerate to train target in Makefile (was only on train_large)

Both make train and make train_large compile cleanly on macOS.

- Vectorize adam_update with vDSP batch ops (stories_cpu_ops.h)
  Replaces scalar per-element loop with vDSP_vsmul/vsma/vsq/vdiv
  Expected ~3-4x faster for 2.4M parameter updates

- Vectorize model_adam_step ADAM_UPDATE macro with vDSP (backward.h)
  Same batch ops pattern for the train.m model pipeline

- Replace cpu_accum_dW with cblas_sgemm (backward.h)
  dW += dy^T @ x is a standard BLAS GEMM operation
  Expected 5-10x faster for weight gradient accumulation

- Replace cpu_matmul_backward_dx with cblas_sgemm (backward.h)
  dx = dy @ W^T is also a standard BLAS GEMM

- Add -framework Accelerate to train target (Makefile)
@alvgeppetto-debug
Copy link
Author

alvgeppetto-debug commented Mar 3, 2026

Benchmark Results

Hardware: Apple M3 Ultra (28-core CPU, 96 GB RAM, 31.6 TFLOPS peak ANE dual-die)
Config: train_large --steps 30, synthetic 100K tokens

Metric main (baseline) cpu-vectorization Delta
Avg train 101.5 ms/step 96.7 ms/step -4.7%
Peak batch 99.0 ms/step 87.5 ms/step -11.6%
ANE TFLOPS 0.92 0.96 +4.3%
Peak ANE TFLOPS 0.94 1.06 +12.8%
Total TFLOPS 1.72 1.80 +4.7%
elem (Adam+embed) 15.0 ms/step 13.1 ms/step -12.7%

Findings

  • vDSP Adam vectorization cuts elem time by ~13%
  • Peak throughput reaches 87.5 ms/step and 1.06 ANE TFLOPS on warmest batch
  • Wall-clock unchanged at 30 steps (compile-dominated at 77%); training time itself drops 4.8% (2900ms vs 3046ms)
  • Benefit compounds at higher step counts where compile overhead is amortized

dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant