perf: reduce compile and IO overhead by alvgeppetto-debug · Pull Request #33 · maderix/ANE

alvgeppetto-debug · 2026-03-03T19:58:54Z

Reduce compile and IO overhead in the training loop.

Changes:

ACCUM_STEPS configurable via ANE_ACCUM_STEPS env var (default 10). Higher values mean fewer exec() restarts and better effective throughput.
MAX_COMPILES configurable via ANE_MAX_COMPILES env var (default 100). Allows tuning for different hardware/OS versions.
IOSurface pooling: reuse freed surfaces by size instead of creating new ones. Avoids repeated IOSurfaceCreate/CFRelease on every recompile cycle. Pool capacity: 128 surfaces with swap-remove lookup.
Applied to both tiny_train.m and train_large.m codepaths.

Both make train and make train_large compile cleanly on macOS.

- Make ACCUM_STEPS configurable via ANE_ACCUM_STEPS env var (default 10) Higher values = fewer exec() restarts, better effective throughput - Make MAX_COMPILES configurable via ANE_MAX_COMPILES env var (default 100) Allows tuning for different hardware/OS versions - IOSurface pooling: reuse freed surfaces by size instead of creating new Avoids repeated IOSurfaceCreate/CFRelease on every recompile cycle Pool capacity: 128 surfaces with swap-remove for O(n) lookup

alvgeppetto-debug · 2026-03-03T20:03:41Z

Benchmark Results

Hardware: Apple M3 Ultra (28-core CPU, 96 GB RAM, 31.6 TFLOPS peak ANE dual-die)
Config: train_large --steps 30, synthetic 100K tokens

Default (ACCUM_STEPS=10)

Metric	main (baseline)	compile-io	Delta
Avg train	101.5 ms/step	100.7 ms/step	-0.8%
Wall time	16.2s	15.9s	-1.9%

With ANE_ACCUM_STEPS=20

Metric	main (baseline)	compile-io (accum=20)	Delta
Avg train	101.5 ms/step	98.5 ms/step	-3.0%
Wall time	16.2s	11.7s	-27.8%
Compile overhead	77.6%	71.7%	-5.9pp
exec() restarts	3	2	-33%

Findings

Configurable ACCUM_STEPS is the main value: doubling from 10 to 20 cuts wall-clock by ~28% via fewer exec() restarts
IOSurface pooling provides marginal benefit at this scale; larger benefit expected on longer training runs
Env var tuning (ANE_ACCUM_STEPS, ANE_MAX_COMPILES) enables runtime adjustment without recompilation

…ES (upstream PR maderix#33)

dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 4, 2026

[feat] Add IOSurface pool and env-configurable ACCUM_STEPS/MAX_COMPIL…

46319e9

…ES (upstream PR maderix#33)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce compile and IO overhead#33

perf: reduce compile and IO overhead#33
alvgeppetto-debug wants to merge 1 commit intomaderix:mainfrom
alvgeppetto-debug:perf/compile-io

alvgeppetto-debug commented Mar 3, 2026

Uh oh!

alvgeppetto-debug commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alvgeppetto-debug commented Mar 3, 2026

Uh oh!

alvgeppetto-debug commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Default (ACCUM_STEPS=10)

With ANE_ACCUM_STEPS=20

Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alvgeppetto-debug commented Mar 3, 2026 •

edited

Loading