Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS by codegen-sh[bot] · Pull Request #4 · dermitchell1993/ANE

codegen-sh · 2026-03-03T06:57:52Z

Includes all changes from PR #2 (dynamic platform detection for ANE) with two bug fixes applied on top:

🔴 Bug 1: `ane_print_platform()` signature mismatch (compile-breaking)

ane_compat.h:190 declared static void ane_print_platform(const ANEPlatform *p), but all 14 call sites across every .m file called it with zero arguments — a hard compile error.

Fix: Changed to static void ane_print_platform(void), using &g_ane_platform internally. This matches the existing pattern of ane_mil_target() and ane_peak_tflops(). Added a lazy-init guard for safety.

🟡 Bug 2: Hardcoded peak TFLOPS in `inmem_peak.m`

Line 110 used tf/0.019*100 — a hardcoded 19 TFLOPS value that defeats the purpose of dynamic chip detection.

Fix: Replaced with tf/ane_peak_tflops()*100 to use the runtime-detected peak value.

Addresses bugs identified in the review of #2. Since that PR originates from a fork and the author is unavailable, this PR applies the fixes directly.

💻 View my work • 👤 Initiated by @dermitchell1993 • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks

Replace all hardcoded M4-only MIL program versions, iOS targets, and TFLOPS values with runtime-detected equivalents across 12 training files and 23 MIL programs. Key changes: - Add ane_compat.h: runtime chip detection (M1→M4+), per-chip MIL version/target selection, and dynamic peak TFLOPS lookup - Convert all 23 inline MIL generators from hardcoded program(1.3) / func main<ios18> to dynamic program(%s) / func main<%s> with g_ane_platform.mil_program and ane_mil_target() format args - Empty BuildInfo version strings (populated by MIL compiler at build) - Replace hardcoded 15.8 TFLOPS divisors in tiny_train.m and train_large.m with ane_peak_tflops() calls - Add ane_detect_platform() + ane_print_platform() init in all 12 executable main() functions - Update ane_mil_gen.h and stories_mil.h/stories_config.h helper headers to use dynamic platform values - Normalize train_large.m line endings (CRLF→LF) Supports: M1, M1 Pro/Max/Ultra, M2, M2 Pro/Max/Ultra, M3, M3 Pro/Max, M4, M4 Pro/Max, and future Apple Silicon via safe fallback defaults. Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>

…LOPS Bug 1: ane_print_platform(const ANEPlatform *p) took a pointer parameter but all 14 call sites passed zero arguments. Changed to void, using the global g_ane_platform internally (matching ane_mil_target/ane_peak_tflops pattern). Added lazy-init guard for safety. Bug 2: inmem_peak.m used hardcoded tf/0.019*100 for peak utilization. Replaced with tf/ane_peak_tflops()*100 to use the dynamically detected peak TFLOPS value.

codegen-sh · 2026-03-03T07:26:05Z

training/ane_compat.h

@@ -0,0 +1,224 @@
+// ane_compat.h — Runtime platform detection for Apple Silicon ANE compatibility


This is a slick piece of engineering — runtime chip detection, per-platform TFLOPS, dynamic MIL target selection. Genuinely well-designed infrastructure.

However, there's a critical gap: ane_compat.h dynamically selects the MIL spec version (ios16/ios17/ios18) based on macOS version, but the actual MIL body syntax in the templates still uses the M4-era shorthand scalar constructors:

string("x") // M4+ only — older CoreML parsers reject this bool(false) // M4+ only int32(1) // M4+ only uint64(64) // M4+ only

On M1/M2 (macOS 13/14), even with ios16 as the target, the CoreML compiler will choke on these shorthand forms. PR #3 addressed this by converting to the canonical verbose forms:

tensor<string, []>("x") tensor<bool, []>(false) tensor<int32, []>(1) tensor<uint64, []>(64)

So right now, ane_compat.h is necessary-but-not-sufficient for M1/M2. The target selection is correct; the MIL body syntax is the remaining barrier.

Also, the cast op between fp32↔fp16 still fails on M1/M2 ANE hardware regardless of syntax — PR #3 added an fp16 I/O fallback for that. This PR has no such mechanism.

codegen-sh · 2026-03-03T07:26:20Z

training/ane_compat.h

+static ANEPlatform g_ane_platform = {0};
+static bool g_ane_platform_detected = false;
+
+// ---- Internal helpers ----


Good fix — the original PR #2 declared ane_print_platform(const ANEPlatform *p) but every call site used ane_print_platform() with zero args. Changing to void with internal &g_ane_platform access plus lazy-init guard is the right call. Clean and consistent with ane_mil_target() and ane_peak_tflops().

One minor note: static globals in a header mean each translation unit gets its own copy of g_ane_platform and g_ane_platform_detected. Currently fine (single-file compilation units), but would silently break if the project ever links multiple .m files together — detection would run independently in each TU.

codegen-sh · 2026-03-03T07:26:33Z

inmem_peak.m

        double ms=bench(c,s,d);
        double tf=ms>0?gf/ms:0;
-        if(ms>0)printf("%-28s %6.1f  %6.2f  %7.3f ms %6.2f  %5.1f%%\n",l,w,gf,ms,tf,tf/0.019*100);
+        if(ms>0)printf("%-28s %6.1f  %6.2f  %7.3f ms %6.2f  %5.1f%%\n",l,w,gf,ms,tf,tf/ane_peak_tflops()*100);


Good catch fixing this — tf/0.019*100 completely defeats the purpose of dynamic chip detection. Using ane_peak_tflops() here is obviously correct. 👍

codegen-sh · 2026-03-03T07:26:55Z

Review Summary

I see what you mean — this is PR #2's ane_compat.h platform detection with two legitimate bug fixes layered on top. The fixes are correct; the bigger question is how this PR relates to PR #3.

✅ Bug Fixes (Both Correct)

ane_print_platform signature — PR feat: M2 backward compatibility — dynamic ANE platform detection #2 declared it as (const ANEPlatform *p) but all 14 call sites passed zero args. Fixed to (void) with internal &g_ane_platform access. Clean.
Hardcoded peak TFLOPS — tf/0.019*100 replaced with tf/ane_peak_tflops()*100. Obviously correct.

⚠️ The Elephant in the Room: PR #3 vs PR #4

These two PRs are competing approaches to M1/M2 compatibility that cannot coexist without reconciliation:

	PR #3	PR #4
MIL spec version	Hardcoded `program(1.0)` + `ios16`	Dynamic via `ane_compat.h` (best per platform)
MIL body syntax	✅ Canonical verbose (`tensor<string, []>`)	❌ Still uses M4-only shorthand (`string("x")`, `bool(false)`, `int32(1)`)
fp16 I/O fallback	✅ `g_fp16_io` + auto-retry on compile fail	❌ None — `cast` op still fails on M1/M2
Peak TFLOPS	❌ Hardcoded	✅ Dynamic per chip
Platform detection	❌ None	✅ Full chip/macOS/ANE detection

Bottom line: PR #4 does the smart thing (detect and adapt) but misses the necessary thing (canonical MIL syntax + fp16 fallback). PR #3 does the necessary thing but leaves some optimization on the table.

🎯 Recommended Path

The ideal merge would combine:

PR fix: MIL syntax + M1/M2 backward compatibility #3's canonical MIL syntax + fp16 I/O fallback (the actual compatibility fix)
PR Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS #4's ane_compat.h runtime detection + dynamic TFLOPS (the infrastructure to be optimally compatible)

That way you get correct behavior on all Apple Silicon and the ability to use the best available MIL features per platform.

codegen-sh · 2026-03-04T02:55:24Z

training/ane_compat.h

+// Global platform info (set once by ane_detect_platform)
+static ANEPlatform g_ane_platform = {0};
+static bool g_ane_platform_detected = false;
+


Architecture note worth flagging for future: static globals in a header = every .m file that #includes this gets its own independent copy of g_ane_platform and g_ane_platform_detected. Currently fine since each .m compiles as its own binary, but if anything ever links two translation units together (e.g., a shared helper .m + a main.m), ane_detect_platform() silently runs twice and each TU can diverge.

Not blocking — this project's structure makes it safe today. But a future-proof approach would be moving the globals + detect function to a single .m implementation file and keeping only declarations in the header. Worth a // NOTE: comment at minimum.

codegen-sh · 2026-03-04T02:55:37Z

training/ane_compat.h

+    p.ane_peak_tflops = _ane_peak_tflops(p.chip);
+
+    // 4. Select MIL target based on macOS version
+    //    - macOS 15+ (Sequoia)  → ios18 + program(1.3)


The MIL target selection logic keys off macos_major alone, which is correct for the current matrix (13→ios16, 14→ios17, 15→ios18). However, this means macOS 16+ (whatever ships with M5-era hardware) will also get ios18 + program(1.3).

That's likely fine as a conservative fallback (newer CoreML should be backward-compatible with ios18 targets), but worth noting that if Apple introduces an ios19 MIL target with macOS 16, this function won't automatically pick it up. The mil_program might also bump (e.g., 1.4).

A defensive option: add // TODO: update when macOS 16 / ios19 target is known here so it doesn't silently regress.

codegen-sh · 2026-03-04T02:55:51Z

training/test_ane_causal_attn.m

+            "tensor<fp16, [1, %d, %d, %d]> k, tensor<fp16, [1, %d, %d, %d]> v) {\n"
+            "        tensor<fp16, [1, %d, %d, %d]> att = scaled_dot_product_attention("
+            "query = q, key = k, value = v)[name = string(\"sdpa\")];\n"
+            "    } -> (att);\n}\n",


Heads up — the MIL body syntax here still uses M4-era shorthand forms:

string("pt") // should be: tensor<string, []>("pt") int32(1) // should be: tensor<int32, []>(1) bool(false) // should be: tensor<bool, []>(false) uint64(64) // should be: tensor<uint64, []>(64)

Even though ane_compat.h now correctly selects ios16 / program(1.0) on M1/M2 macOS, the CoreML MIL parser on those older systems rejects the shorthand string(...) / int32(...) / bool(...) scalar constructor syntax. The canonical verbose forms (tensor<string, []>(...) etc.) are required for backward compatibility.

PR #3 addressed exactly this across all MIL templates. This file (and most others in this PR) still need that syntax conversion to actually compile on M1/M2.

This is a recurring pattern across all 23 MIL programs in this PR — same issue in every file.

codegen-sh · 2026-03-04T02:56:19Z

Review — Second Pass

The prior review nailed the macro-level analysis (PR #3 vs #4 reconciliation). I'm adding some specific findings from a deeper code read.

✅ Bug fixes: Both correct

ane_print_platform(void) — Clean fix. Lazy-init guard is appropriate.
tf/ane_peak_tflops()*100 — Correct replacement of hardcoded 0.019.

🟡 MIL body syntax — still the critical gap

Having now read both PRs side-by-side, I can confirm:

Syntax form	PR #3	PR #4
`string("x")` → `tensor<string, []>("x")`	✅	❌
`int32(1)` → `tensor<int32, []>(1)`	✅	❌
`bool(false)` → `tensor<bool, []>(false)`	✅	❌
`uint64(64)` → `tensor<uint64, []>(64)`	✅	❌
`buildInfo dict<string, string>` → `dict<tensor<string, []>, tensor<string, []>>`	✅	❌
fp16 I/O fallback (no `cast` on M1/M2)	✅	❌
Dynamic chip detection	❌	✅
Dynamic TFLOPS	❌	✅

Even with ane_compat.h correctly selecting ios16 / program(1.0) on M2, the MIL body will fail to parse on macOS 13/14's CoreML because of those shorthand scalar constructors. The buildInfo dict type signature is also wrong for older CoreML (dict<string, string> vs dict<tensor<string, []>, tensor<string, []>>).

🔧 Additional code-level notes

static globals in header — Each .m TU gets its own copy. Safe today (single-binary compilation units), would silently diverge if two .m files ever link together. Worth a comment.
MIL target logic — macos_major >= 15 → ios18 is correct now but should have a // TODO for macOS 16 / potential ios19 target.
ane_peak_tflops() fallback — Returns 15.8 for unknown chips. No division-by-zero risk. 👍

🎯 Recommended merge strategy

The strongest path forward is combining both PRs:

Merge PR fix: MIL syntax + M1/M2 backward compatibility #3 first (canonical MIL syntax + fp16 fallback = actual M1/M2 fix)
Then layer PR Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS #4's ane_compat.h on top (dynamic detection + TFLOPS = optimal per-chip behavior)
Use the dynamic target/program selection to conditionally emit shorthand vs verbose syntax — i.e., on M4+/macOS 15+, the shorthand forms work fine and are more readable. On M1/M2, emit verbose forms.

That would give you both correctness and optimal adaptation.

Integrates both PR #3 (M1/M2 canonical verbose MIL syntax + fp16 I/O fallback) and PR #4 (runtime chip/OS detection via ane_compat.h) into a unified solution that works everywhere AND optimizes per-platform. Changes across 16 files: - Add training/ane_compat.h: runtime platform detection library (chip family, macOS version, MIL target selection, peak TFLOPS) - Convert all 38 hardcoded program(1.0) -> program(%s) with g_ane_platform.mil_program dynamic argument - Convert all 44 hardcoded func main<ios16> -> func main<%s> with ane_mil_target() dynamic argument - Replace hardcoded 0.019 TFLOPS constant with ane_peak_tflops() - Add #include ane_compat.h and platform init to 14 consumer files - Preserve PR #3's fp16 I/O auto-retry mechanism for M1/M2 - Use canonical verbose buildInfo syntax (universal compatibility) Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>

codegen-sh bot and others added 2 commits March 2, 2026 23:35

codegen-sh bot assigned dermitchell1993 Mar 3, 2026

codegen-sh bot commented Mar 3, 2026

View reviewed changes

codegen-sh bot commented Mar 4, 2026

View reviewed changes

This was referenced Mar 4, 2026

Dream merge: dynamic platform detection + canonical MIL syntax #5

Draft

fix: MIL syntax + M1/M2 backward compatibility #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS#4

Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS#4
codegen-sh[bot] wants to merge 2 commits intomainfrom
codegen-bot/fix-ane-compat-bugs-a7f3e1

codegen-sh bot commented Mar 3, 2026

Uh oh!

codegen-sh bot Mar 3, 2026

Uh oh!

codegen-sh bot Mar 3, 2026

Uh oh!

codegen-sh bot Mar 3, 2026

Uh oh!

codegen-sh bot commented Mar 3, 2026

Uh oh!

codegen-sh bot Mar 4, 2026

Uh oh!

codegen-sh bot Mar 4, 2026

Uh oh!

codegen-sh bot Mar 4, 2026

Uh oh!

codegen-sh bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,224 @@
		// ane_compat.h — Runtime platform detection for Apple Silicon ANE compatibility

Conversation

codegen-sh bot commented Mar 3, 2026

🔴 Bug 1: ane_print_platform() signature mismatch (compile-breaking)

🟡 Bug 2: Hardcoded peak TFLOPS in inmem_peak.m

Uh oh!

codegen-sh bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot commented Mar 3, 2026

Review Summary

✅ Bug Fixes (Both Correct)

⚠️ The Elephant in the Room: PR #3 vs PR #4

🎯 Recommended Path

Uh oh!

codegen-sh bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

codegen-sh bot commented Mar 4, 2026

Review — Second Pass

✅ Bug fixes: Both correct

🟡 MIL body syntax — still the critical gap

🔧 Additional code-level notes

🎯 Recommended merge strategy

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🔴 Bug 1: `ane_print_platform()` signature mismatch (compile-breaking)

🟡 Bug 2: Hardcoded peak TFLOPS in `inmem_peak.m`