Skip to content

M3 Ultra (512GB, Mac15,14) benchmark results — 512ch hard constraint #42

@pudepiedj

Description

@pudepiedj

Hardware: Apple M3 Ultra, 512 GB unified memory, Mac Studio (Mac15,14)
macOS: 15.7.4 (24G517)
Commit: efcf193
Clang: Apple clang 17.0.0

system_info.txt

Key finding: 512ch is the only valid channel count

Every channel configuration except exactly 512 fails with -4 (invalid config)
or -3 (compile failure). This is not "minimum 512" — it is exactly 512.

inmem_bench

inmem_bench.log

=== In-Memory ANE Benchmark ===
Config W (MB) ms/eval TFLOPS
256ch x64sp 0.1 FAIL(-4)
512ch x64sp 0.5 0.189 ms 0.18
1024ch x64sp 2.0 FAIL(-4)
2048ch x64sp 8.0 FAIL(-4)
3072ch x64sp 18.0 FAIL(-4)
4096ch x64sp 32.0 FAIL(-4)

inmem_peak

inmem_peak.log

=== Programmatic MIL → In-Memory ANE Peak ===
Config W(MB) GFLOP ms/eval TFLOPS
32x conv 512ch sp64 16.0 1.07 0.281 ms 3.82
48x conv 512ch sp64 24.0 1.61 0.315 ms 5.11
64x conv 512ch sp64 32.0 2.15 0.313 ms 6.86
96x conv 512ch sp64 48.0 3.22 0.380 ms 8.47
128x conv 512ch sp64 64.0 4.29 0.490 ms 8.77
64x conv 256ch sp64 8.0 0.54 FAIL(-3)
128x conv 256ch sp64 16.0 1.07 FAIL(-3)
256x conv 256ch sp64 32.0 2.15 FAIL(-3)
64x conv 384ch sp64 18.0 1.21 FAIL(-3)
128x conv 384ch sp64 36.0 2.42 FAIL(-3)

Peak sustained: 8.77 TFLOPS at 128x conv depth.

sram_probe

sram_probe.log

=== ANE SRAM Fine Probe (weights only vary, spatial=64) ===
Channels W (MB) ms/eval TFLOPS GFLOPS/MB
512 ch 0.5 0.173 ms 0.19 388.8
1024 ch 2.0 -4.000 ms 0.00
(all others: FAIL -4)

SRAM cliff is unmeasurable — the channel sweep is blocked by the 512ch
constraint. The benchmark's "spilling?" annotations are incorrect; these are
hard rejections, not DRAM spills.

sram_bench

Same pattern — only 512ch succeeds. No SRAM cliff data obtainable with
current benchmark design on this hardware.

sram_bench.log

Open questions

  1. Whether both ANE dies (UltraFusion) are being addressed — 8.77 TFLOPS is
    consistent with a single M3 Max ANE die. If only one die is active, true
    dual-die peak may be ~17+ TFLOPS but currently unreachable(?) via _ANEClient.
  2. Why exactly 512ch and no other value — possibly a UltraFusion tensor
    partitioning requirement.
  3. SRAM size remains unknown for M3 Ultra ANE — would require a benchmark
    variant that sweeps spatial dimension or weight size while holding channels
    fixed at 512.

The %peak column in inmem_peak seems nonsensical (shows 46000%+) — denominator
appears hardcoded to M4 TFLOPS.

Happy to run additional experiments if the benchmark suite is extended to
handle the 512ch constraint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions