Skip to content

Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS#4

Draft
codegen-sh[bot] wants to merge 2 commits intomainfrom
codegen-bot/fix-ane-compat-bugs-a7f3e1
Draft

Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS#4
codegen-sh[bot] wants to merge 2 commits intomainfrom
codegen-bot/fix-ane-compat-bugs-a7f3e1

Conversation

@codegen-sh
Copy link

@codegen-sh codegen-sh bot commented Mar 3, 2026

Includes all changes from PR #2 (dynamic platform detection for ANE) with two bug fixes applied on top:

🔴 Bug 1: ane_print_platform() signature mismatch (compile-breaking)

ane_compat.h:190 declared static void ane_print_platform(const ANEPlatform *p), but all 14 call sites across every .m file called it with zero arguments — a hard compile error.

Fix: Changed to static void ane_print_platform(void), using &g_ane_platform internally. This matches the existing pattern of ane_mil_target() and ane_peak_tflops(). Added a lazy-init guard for safety.

🟡 Bug 2: Hardcoded peak TFLOPS in inmem_peak.m

Line 110 used tf/0.019*100 — a hardcoded 19 TFLOPS value that defeats the purpose of dynamic chip detection.

Fix: Replaced with tf/ane_peak_tflops()*100 to use the runtime-detected peak value.


Addresses bugs identified in the review of #2. Since that PR originates from a fork and the author is unavailable, this PR applies the fixes directly.


💻 View my work • 👤 Initiated by @dermitchell1993About Codegen
⛔ Remove Codegen from PR🚫 Ban action checks

codegen-sh bot and others added 2 commits March 2, 2026 23:35
Replace all hardcoded M4-only MIL program versions, iOS targets, and
TFLOPS values with runtime-detected equivalents across 12 training files
and 23 MIL programs.

Key changes:
- Add ane_compat.h: runtime chip detection (M1→M4+), per-chip MIL
  version/target selection, and dynamic peak TFLOPS lookup
- Convert all 23 inline MIL generators from hardcoded program(1.3) /
  func main<ios18> to dynamic program(%s) / func main<%s> with
  g_ane_platform.mil_program and ane_mil_target() format args
- Empty BuildInfo version strings (populated by MIL compiler at build)
- Replace hardcoded 15.8 TFLOPS divisors in tiny_train.m and
  train_large.m with ane_peak_tflops() calls
- Add ane_detect_platform() + ane_print_platform() init in all 12
  executable main() functions
- Update ane_mil_gen.h and stories_mil.h/stories_config.h helper
  headers to use dynamic platform values
- Normalize train_large.m line endings (CRLF→LF)

Supports: M1, M1 Pro/Max/Ultra, M2, M2 Pro/Max/Ultra, M3, M3 Pro/Max,
M4, M4 Pro/Max, and future Apple Silicon via safe fallback defaults.

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
…LOPS

Bug 1: ane_print_platform(const ANEPlatform *p) took a pointer parameter
but all 14 call sites passed zero arguments. Changed to void, using the
global g_ane_platform internally (matching ane_mil_target/ane_peak_tflops
pattern). Added lazy-init guard for safety.

Bug 2: inmem_peak.m used hardcoded tf/0.019*100 for peak utilization.
Replaced with tf/ane_peak_tflops()*100 to use the dynamically detected
peak TFLOPS value.
@@ -0,0 +1,224 @@
// ane_compat.h — Runtime platform detection for Apple Silicon ANE compatibility
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a slick piece of engineering — runtime chip detection, per-platform TFLOPS, dynamic MIL target selection. Genuinely well-designed infrastructure.

However, there's a critical gap: ane_compat.h dynamically selects the MIL spec version (ios16/ios17/ios18) based on macOS version, but the actual MIL body syntax in the templates still uses the M4-era shorthand scalar constructors:

string("x")      // M4+ only — older CoreML parsers reject this
bool(false)       // M4+ only
int32(1)          // M4+ only
uint64(64)        // M4+ only

On M1/M2 (macOS 13/14), even with ios16 as the target, the CoreML compiler will choke on these shorthand forms. PR #3 addressed this by converting to the canonical verbose forms:

tensor<string, []>("x")
tensor<bool, []>(false)
tensor<int32, []>(1)
tensor<uint64, []>(64)

So right now, ane_compat.h is necessary-but-not-sufficient for M1/M2. The target selection is correct; the MIL body syntax is the remaining barrier.

Also, the cast op between fp32↔fp16 still fails on M1/M2 ANE hardware regardless of syntax — PR #3 added an fp16 I/O fallback for that. This PR has no such mechanism.

static ANEPlatform g_ane_platform = {0};
static bool g_ane_platform_detected = false;

// ---- Internal helpers ----
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix — the original PR #2 declared ane_print_platform(const ANEPlatform *p) but every call site used ane_print_platform() with zero args. Changing to void with internal &g_ane_platform access plus lazy-init guard is the right call. Clean and consistent with ane_mil_target() and ane_peak_tflops().

One minor note: static globals in a header mean each translation unit gets its own copy of g_ane_platform and g_ane_platform_detected. Currently fine (single-file compilation units), but would silently break if the project ever links multiple .m files together — detection would run independently in each TU.

double ms=bench(c,s,d);
double tf=ms>0?gf/ms:0;
if(ms>0)printf("%-28s %6.1f %6.2f %7.3f ms %6.2f %5.1f%%\n",l,w,gf,ms,tf,tf/0.019*100);
if(ms>0)printf("%-28s %6.1f %6.2f %7.3f ms %6.2f %5.1f%%\n",l,w,gf,ms,tf,tf/ane_peak_tflops()*100);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch fixing this — tf/0.019*100 completely defeats the purpose of dynamic chip detection. Using ane_peak_tflops() here is obviously correct. 👍

@codegen-sh
Copy link
Author

codegen-sh bot commented Mar 3, 2026

Review Summary

I see what you mean — this is PR #2's ane_compat.h platform detection with two legitimate bug fixes layered on top. The fixes are correct; the bigger question is how this PR relates to PR #3.

✅ Bug Fixes (Both Correct)

  1. ane_print_platform signature — PR feat: M2 backward compatibility — dynamic ANE platform detection #2 declared it as (const ANEPlatform *p) but all 14 call sites passed zero args. Fixed to (void) with internal &g_ane_platform access. Clean.
  2. Hardcoded peak TFLOPStf/0.019*100 replaced with tf/ane_peak_tflops()*100. Obviously correct.

⚠️ The Elephant in the Room: PR #3 vs PR #4

These two PRs are competing approaches to M1/M2 compatibility that cannot coexist without reconciliation:

PR #3 PR #4
MIL spec version Hardcoded program(1.0) + ios16 Dynamic via ane_compat.h (best per platform)
MIL body syntax ✅ Canonical verbose (tensor<string, []>) ❌ Still uses M4-only shorthand (string("x"), bool(false), int32(1))
fp16 I/O fallback g_fp16_io + auto-retry on compile fail ❌ None — cast op still fails on M1/M2
Peak TFLOPS ❌ Hardcoded ✅ Dynamic per chip
Platform detection ❌ None ✅ Full chip/macOS/ANE detection

Bottom line: PR #4 does the smart thing (detect and adapt) but misses the necessary thing (canonical MIL syntax + fp16 fallback). PR #3 does the necessary thing but leaves some optimization on the table.

🎯 Recommended Path

The ideal merge would combine:

That way you get correct behavior on all Apple Silicon and the ability to use the best available MIL features per platform.

// Global platform info (set once by ane_detect_platform)
static ANEPlatform g_ane_platform = {0};
static bool g_ane_platform_detected = false;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecture note worth flagging for future: static globals in a header = every .m file that #includes this gets its own independent copy of g_ane_platform and g_ane_platform_detected. Currently fine since each .m compiles as its own binary, but if anything ever links two translation units together (e.g., a shared helper .m + a main.m), ane_detect_platform() silently runs twice and each TU can diverge.

Not blocking — this project's structure makes it safe today. But a future-proof approach would be moving the globals + detect function to a single .m implementation file and keeping only declarations in the header. Worth a // NOTE: comment at minimum.

p.ane_peak_tflops = _ane_peak_tflops(p.chip);

// 4. Select MIL target based on macOS version
// - macOS 15+ (Sequoia) → ios18 + program(1.3)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MIL target selection logic keys off macos_major alone, which is correct for the current matrix (13→ios16, 14→ios17, 15→ios18). However, this means macOS 16+ (whatever ships with M5-era hardware) will also get ios18 + program(1.3).

That's likely fine as a conservative fallback (newer CoreML should be backward-compatible with ios18 targets), but worth noting that if Apple introduces an ios19 MIL target with macOS 16, this function won't automatically pick it up. The mil_program might also bump (e.g., 1.4).

A defensive option: add // TODO: update when macOS 16 / ios19 target is known here so it doesn't silently regress.

"tensor<fp16, [1, %d, %d, %d]> k, tensor<fp16, [1, %d, %d, %d]> v) {\n"
" tensor<fp16, [1, %d, %d, %d]> att = scaled_dot_product_attention("
"query = q, key = k, value = v)[name = string(\"sdpa\")];\n"
" } -> (att);\n}\n",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up — the MIL body syntax here still uses M4-era shorthand forms:

string("pt")       // should be: tensor<string, []>("pt")
int32(1)           // should be: tensor<int32, []>(1)
bool(false)        // should be: tensor<bool, []>(false)
uint64(64)         // should be: tensor<uint64, []>(64)

Even though ane_compat.h now correctly selects ios16 / program(1.0) on M1/M2 macOS, the CoreML MIL parser on those older systems rejects the shorthand string(...) / int32(...) / bool(...) scalar constructor syntax. The canonical verbose forms (tensor<string, []>(...) etc.) are required for backward compatibility.

PR #3 addressed exactly this across all MIL templates. This file (and most others in this PR) still need that syntax conversion to actually compile on M1/M2.

This is a recurring pattern across all 23 MIL programs in this PR — same issue in every file.

@codegen-sh
Copy link
Author

codegen-sh bot commented Mar 4, 2026

Review — Second Pass

The prior review nailed the macro-level analysis (PR #3 vs #4 reconciliation). I'm adding some specific findings from a deeper code read.

✅ Bug fixes: Both correct

  1. ane_print_platform(void) — Clean fix. Lazy-init guard is appropriate.
  2. tf/ane_peak_tflops()*100 — Correct replacement of hardcoded 0.019.

🟡 MIL body syntax — still the critical gap

Having now read both PRs side-by-side, I can confirm:

Syntax form PR #3 PR #4
string("x")tensor<string, []>("x")
int32(1)tensor<int32, []>(1)
bool(false)tensor<bool, []>(false)
uint64(64)tensor<uint64, []>(64)
buildInfo dict<string, string>dict<tensor<string, []>, tensor<string, []>>
fp16 I/O fallback (no cast on M1/M2)
Dynamic chip detection
Dynamic TFLOPS

Even with ane_compat.h correctly selecting ios16 / program(1.0) on M2, the MIL body will fail to parse on macOS 13/14's CoreML because of those shorthand scalar constructors. The buildInfo dict type signature is also wrong for older CoreML (dict<string, string> vs dict<tensor<string, []>, tensor<string, []>>).

🔧 Additional code-level notes

  1. static globals in header — Each .m TU gets its own copy. Safe today (single-binary compilation units), would silently diverge if two .m files ever link together. Worth a comment.
  2. MIL target logicmacos_major >= 15ios18 is correct now but should have a // TODO for macOS 16 / potential ios19 target.
  3. ane_peak_tflops() fallback — Returns 15.8 for unknown chips. No division-by-zero risk. 👍

🎯 Recommended merge strategy

The strongest path forward is combining both PRs:

  1. Merge PR fix: MIL syntax + M1/M2 backward compatibility #3 first (canonical MIL syntax + fp16 fallback = actual M1/M2 fix)
  2. Then layer PR Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS #4's ane_compat.h on top (dynamic detection + TFLOPS = optimal per-chip behavior)
  3. Use the dynamic target/program selection to conditionally emit shorthand vs verbose syntax — i.e., on M4+/macOS 15+, the shorthand forms work fine and are more readable. On M1/M2, emit verbose forms.

That would give you both correctness and optimal adaptation.

codegen-sh bot added a commit that referenced this pull request Mar 4, 2026
Integrates both PR #3 (M1/M2 canonical verbose MIL syntax + fp16 I/O
fallback) and PR #4 (runtime chip/OS detection via ane_compat.h) into
a unified solution that works everywhere AND optimizes per-platform.

Changes across 16 files:
- Add training/ane_compat.h: runtime platform detection library
  (chip family, macOS version, MIL target selection, peak TFLOPS)
- Convert all 38 hardcoded program(1.0) -> program(%s) with
  g_ane_platform.mil_program dynamic argument
- Convert all 44 hardcoded func main<ios16> -> func main<%s> with
  ane_mil_target() dynamic argument
- Replace hardcoded 0.019 TFLOPS constant with ane_peak_tflops()
- Add #include ane_compat.h and platform init to 14 consumer files
- Preserve PR #3's fp16 I/O auto-retry mechanism for M1/M2
- Use canonical verbose buildInfo syntax (universal compatibility)

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
codegen-sh bot added a commit that referenced this pull request Mar 5, 2026
Integrates both PR #3 (M1/M2 canonical verbose MIL syntax + fp16 I/O
fallback) and PR #4 (runtime chip/OS detection via ane_compat.h) into
a unified solution that works everywhere AND optimizes per-platform.

Changes across 16 files:
- Add training/ane_compat.h: runtime platform detection library
  (chip family, macOS version, MIL target selection, peak TFLOPS)
- Convert all 38 hardcoded program(1.0) -> program(%s) with
  g_ane_platform.mil_program dynamic argument
- Convert all 44 hardcoded func main<ios16> -> func main<%s> with
  ane_mil_target() dynamic argument
- Replace hardcoded 0.019 TFLOPS constant with ane_peak_tflops()
- Add #include ane_compat.h and platform init to 14 consumer files
- Preserve PR #3's fp16 I/O auto-retry mechanism for M1/M2
- Use canonical verbose buildInfo syntax (universal compatibility)

Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant