Skip to content

feat: async double-buffered ANE training (eliminates 88% compile bottleneck)#23

Open
mgkcloud wants to merge 4 commits intomaderix:mainfrom
mgkcloud:pr/double-buffer
Open

feat: async double-buffered ANE training (eliminates 88% compile bottleneck)#23
mgkcloud wants to merge 4 commits intomaderix:mainfrom
mgkcloud:pr/double-buffer

Conversation

@mgkcloud
Copy link

@mgkcloud mgkcloud commented Mar 3, 2026

Problem

Training on ANE is bottlenecked by recompilation, not compute. Because weights are baked at compile time, every weight update requires a full ANE recompile. On M4:

  • Compile: 55ms per step
  • Training (eval): 0.54ms per step
  • 88.6% of wall time is compilation overhead

Solution

Async double-buffered compilation using GCD (Grand Central Dispatch).

Two kernel sets (A/B) alternate between active and pending:

  1. Kernel A runs training evals in the foreground
  2. Kernel B compiles in a background GCD thread with updated weights
  3. At batch boundary, atomic swap: B becomes active, A becomes pending
  4. Zero stall on kernel swap

Results (M4, 10 cores, 32GB)

Metric Before After
Compile stall per step 55ms 0ms
Sustained throughput ~0.8 TFLOPS 7.5 TFLOPS
ANE utilization ~11% 47% of theoretical
Peak (96x conv 512ch) - 0.43ms = 7.50 TFLOPS

Verified over 50 training steps with 5 batches each. Every batch achieved compile_stall=0ms. Kernel swap confirmed at step boundaries.

Key findings (PROBE_RESULTS.md)

  • ANE has a per-process compile limit (~119-132 on M4). After that, exec() restart is required.
  • Weights are truly baked at compile time. unload+load after file overwrite does NOT update weights.
  • All QoS values (0-63) work with no measurable latency difference.
  • _ANEChainingRequest supports loopback execution for potential kernel fusion.

Files

  • training/train_double_buffer.m — Modified train_large.m with async compilation
  • PROBE_RESULTS.md — Full benchmark data from M4 probe experiments
  • training/Makefile — Updated with train_double_buffer target

Usage

cd training
make train_double_buffer
./train_double_buffer

Tested on Apple M4 (macOS 26.3). Should work on any Apple Silicon with ANE.

mgkcloud added 4 commits March 4, 2026 00:12
Key discovery: compile and eval can run in parallel via GCD.
119 foreground evals completed during a 26.8ms background compile.

Architecture:
- Two kernel sets (A/B) alternate active/pending
- Background GCD thread compiles pending kernels while active runs
- Atomic swap at batch boundary
- Eliminates 88% compilation bottleneck

Includes:
- train_double_buffer.m: modified train_large.m with async compilation
- PROBE_RESULTS.md: full benchmark data from M4 probe
- Updated Makefile
dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 3, 2026
…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).
dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 3, 2026
…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant