Skip to content

perf: reduce compile and IO overhead#33

Open
alvgeppetto-debug wants to merge 1 commit intomaderix:mainfrom
alvgeppetto-debug:perf/compile-io
Open

perf: reduce compile and IO overhead#33
alvgeppetto-debug wants to merge 1 commit intomaderix:mainfrom
alvgeppetto-debug:perf/compile-io

Conversation

@alvgeppetto-debug
Copy link

Reduce compile and IO overhead in the training loop.

Changes:

  • ACCUM_STEPS configurable via ANE_ACCUM_STEPS env var (default 10). Higher values mean fewer exec() restarts and better effective throughput.
  • MAX_COMPILES configurable via ANE_MAX_COMPILES env var (default 100). Allows tuning for different hardware/OS versions.
  • IOSurface pooling: reuse freed surfaces by size instead of creating new ones. Avoids repeated IOSurfaceCreate/CFRelease on every recompile cycle. Pool capacity: 128 surfaces with swap-remove lookup.
  • Applied to both tiny_train.m and train_large.m codepaths.

Both make train and make train_large compile cleanly on macOS.

- Make ACCUM_STEPS configurable via ANE_ACCUM_STEPS env var (default 10)
  Higher values = fewer exec() restarts, better effective throughput

- Make MAX_COMPILES configurable via ANE_MAX_COMPILES env var (default 100)
  Allows tuning for different hardware/OS versions

- IOSurface pooling: reuse freed surfaces by size instead of creating new
  Avoids repeated IOSurfaceCreate/CFRelease on every recompile cycle
  Pool capacity: 128 surfaces with swap-remove for O(n) lookup
@alvgeppetto-debug
Copy link
Author

alvgeppetto-debug commented Mar 3, 2026

Benchmark Results

Hardware: Apple M3 Ultra (28-core CPU, 96 GB RAM, 31.6 TFLOPS peak ANE dual-die)
Config: train_large --steps 30, synthetic 100K tokens

Default (ACCUM_STEPS=10)

Metric main (baseline) compile-io Delta
Avg train 101.5 ms/step 100.7 ms/step -0.8%
Wall time 16.2s 15.9s -1.9%

With ANE_ACCUM_STEPS=20

Metric main (baseline) compile-io (accum=20) Delta
Avg train 101.5 ms/step 98.5 ms/step -3.0%
Wall time 16.2s 11.7s -27.8%
Compile overhead 77.6% 71.7% -5.9pp
exec() restarts 3 2 -33%

Findings

  • Configurable ACCUM_STEPS is the main value: doubling from 10 to 20 cuts wall-clock by ~28% via fewer exec() restarts
  • IOSurface pooling provides marginal benefit at this scale; larger benefit expected on longer training runs
  • Env var tuning (ANE_ACCUM_STEPS, ANE_MAX_COMPILES) enables runtime adjustment without recompilation

dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant