feat: async double-buffered ANE training (eliminates 88% compile bottleneck) by mgkcloud · Pull Request #23 · maderix/ANE

mgkcloud · 2026-03-03T13:12:51Z

Problem

Training on ANE is bottlenecked by recompilation, not compute. Because weights are baked at compile time, every weight update requires a full ANE recompile. On M4:

Compile: 55ms per step
Training (eval): 0.54ms per step
88.6% of wall time is compilation overhead

Solution

Async double-buffered compilation using GCD (Grand Central Dispatch).

Two kernel sets (A/B) alternate between active and pending:

Kernel A runs training evals in the foreground
Kernel B compiles in a background GCD thread with updated weights
At batch boundary, atomic swap: B becomes active, A becomes pending
Zero stall on kernel swap

Results (M4, 10 cores, 32GB)

Metric	Before	After
Compile stall per step	55ms	0ms
Sustained throughput	~0.8 TFLOPS	7.5 TFLOPS
ANE utilization	~11%	47% of theoretical
Peak (96x conv 512ch)	-	0.43ms = 7.50 TFLOPS

Verified over 50 training steps with 5 batches each. Every batch achieved compile_stall=0ms. Kernel swap confirmed at step boundaries.

Key findings (PROBE_RESULTS.md)

ANE has a per-process compile limit (~119-132 on M4). After that, exec() restart is required.
Weights are truly baked at compile time. unload+load after file overwrite does NOT update weights.
All QoS values (0-63) work with no measurable latency difference.
_ANEChainingRequest supports loopback execution for potential kernel fusion.

Files

training/train_double_buffer.m — Modified train_large.m with async compilation
PROBE_RESULTS.md — Full benchmark data from M4 probe experiments
training/Makefile — Updated with train_double_buffer target

Usage

cd training
make train_double_buffer
./train_double_buffer

Tested on Apple M4 (macOS 26.3). Should work on any Apple Silicon with ANE.

Key discovery: compile and eval can run in parallel via GCD. 119 foreground evals completed during a 26.8ms background compile. Architecture: - Two kernel sets (A/B) alternate active/pending - Background GCD thread compiles pending kernels while active runs - Atomic swap at batch boundary - Eliminates 88% compilation bottleneck Includes: - train_double_buffer.m: modified train_large.m with async compilation - PROBE_RESULTS.md: full benchmark data from M4 probe - Updated Makefile

…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).

mgkcloud added 4 commits March 4, 2026 00:12

fix: block capture issues for GCD async compile

8fed989

feat: synthetic data fallback for benchmark mode

3469d1d

fix: raise compile budget for double-buffer, add synthetic data

9e6b7c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: async double-buffered ANE training (eliminates 88% compile bottleneck)#23

feat: async double-buffered ANE training (eliminates 88% compile bottleneck)#23
mgkcloud wants to merge 4 commits intomaderix:mainfrom
mgkcloud:pr/double-buffer

mgkcloud commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgkcloud commented Mar 3, 2026

Problem

Solution

Results (M4, 10 cores, 32GB)

Key findings (PROBE_RESULTS.md)

Files

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant