-
Notifications
You must be signed in to change notification settings - Fork 852
Description
Hardware: Apple M3 Ultra, 512 GB unified memory, Mac Studio (Mac15,14)
macOS: 15.7.4 (24G517)
Commit: efcf193
Clang: Apple clang 17.0.0
Key finding: 512ch is the only valid channel count
Every channel configuration except exactly 512 fails with -4 (invalid config)
or -3 (compile failure). This is not "minimum 512" — it is exactly 512.
inmem_bench
=== In-Memory ANE Benchmark ===
Config W (MB) ms/eval TFLOPS
256ch x64sp 0.1 FAIL(-4)
512ch x64sp 0.5 0.189 ms 0.18
1024ch x64sp 2.0 FAIL(-4)
2048ch x64sp 8.0 FAIL(-4)
3072ch x64sp 18.0 FAIL(-4)
4096ch x64sp 32.0 FAIL(-4)
inmem_peak
=== Programmatic MIL → In-Memory ANE Peak ===
Config W(MB) GFLOP ms/eval TFLOPS
32x conv 512ch sp64 16.0 1.07 0.281 ms 3.82
48x conv 512ch sp64 24.0 1.61 0.315 ms 5.11
64x conv 512ch sp64 32.0 2.15 0.313 ms 6.86
96x conv 512ch sp64 48.0 3.22 0.380 ms 8.47
128x conv 512ch sp64 64.0 4.29 0.490 ms 8.77
64x conv 256ch sp64 8.0 0.54 FAIL(-3)
128x conv 256ch sp64 16.0 1.07 FAIL(-3)
256x conv 256ch sp64 32.0 2.15 FAIL(-3)
64x conv 384ch sp64 18.0 1.21 FAIL(-3)
128x conv 384ch sp64 36.0 2.42 FAIL(-3)
Peak sustained: 8.77 TFLOPS at 128x conv depth.
sram_probe
=== ANE SRAM Fine Probe (weights only vary, spatial=64) ===
Channels W (MB) ms/eval TFLOPS GFLOPS/MB
512 ch 0.5 0.173 ms 0.19 388.8
1024 ch 2.0 -4.000 ms 0.00
(all others: FAIL -4)
SRAM cliff is unmeasurable — the channel sweep is blocked by the 512ch
constraint. The benchmark's "spilling?" annotations are incorrect; these are
hard rejections, not DRAM spills.
sram_bench
Same pattern — only 512ch succeeds. No SRAM cliff data obtainable with
current benchmark design on this hardware.
Open questions
- Whether both ANE dies (UltraFusion) are being addressed — 8.77 TFLOPS is
consistent with a single M3 Max ANE die. If only one die is active, true
dual-die peak may be ~17+ TFLOPS but currently unreachable(?) via _ANEClient. - Why exactly 512ch and no other value — possibly a UltraFusion tensor
partitioning requirement. - SRAM size remains unknown for M3 Ultra ANE — would require a benchmark
variant that sweeps spatial dimension or weight size while holding channels
fixed at 512.
The %peak column in inmem_peak seems nonsensical (shows 46000%+) — denominator
appears hardcoded to M4 TFLOPS.
Happy to run additional experiments if the benchmark suite is extended to
handle the 512ch constraint.