Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS#4
Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS#4codegen-sh[bot] wants to merge 2 commits intomainfrom
Conversation
Replace all hardcoded M4-only MIL program versions, iOS targets, and TFLOPS values with runtime-detected equivalents across 12 training files and 23 MIL programs. Key changes: - Add ane_compat.h: runtime chip detection (M1→M4+), per-chip MIL version/target selection, and dynamic peak TFLOPS lookup - Convert all 23 inline MIL generators from hardcoded program(1.3) / func main<ios18> to dynamic program(%s) / func main<%s> with g_ane_platform.mil_program and ane_mil_target() format args - Empty BuildInfo version strings (populated by MIL compiler at build) - Replace hardcoded 15.8 TFLOPS divisors in tiny_train.m and train_large.m with ane_peak_tflops() calls - Add ane_detect_platform() + ane_print_platform() init in all 12 executable main() functions - Update ane_mil_gen.h and stories_mil.h/stories_config.h helper headers to use dynamic platform values - Normalize train_large.m line endings (CRLF→LF) Supports: M1, M1 Pro/Max/Ultra, M2, M2 Pro/Max/Ultra, M3, M3 Pro/Max, M4, M4 Pro/Max, and future Apple Silicon via safe fallback defaults. Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
…LOPS Bug 1: ane_print_platform(const ANEPlatform *p) took a pointer parameter but all 14 call sites passed zero arguments. Changed to void, using the global g_ane_platform internally (matching ane_mil_target/ane_peak_tflops pattern). Added lazy-init guard for safety. Bug 2: inmem_peak.m used hardcoded tf/0.019*100 for peak utilization. Replaced with tf/ane_peak_tflops()*100 to use the dynamically detected peak TFLOPS value.
| @@ -0,0 +1,224 @@ | |||
| // ane_compat.h — Runtime platform detection for Apple Silicon ANE compatibility | |||
There was a problem hiding this comment.
This is a slick piece of engineering — runtime chip detection, per-platform TFLOPS, dynamic MIL target selection. Genuinely well-designed infrastructure.
However, there's a critical gap: ane_compat.h dynamically selects the MIL spec version (ios16/ios17/ios18) based on macOS version, but the actual MIL body syntax in the templates still uses the M4-era shorthand scalar constructors:
string("x") // M4+ only — older CoreML parsers reject this
bool(false) // M4+ only
int32(1) // M4+ only
uint64(64) // M4+ only
On M1/M2 (macOS 13/14), even with ios16 as the target, the CoreML compiler will choke on these shorthand forms. PR #3 addressed this by converting to the canonical verbose forms:
tensor<string, []>("x")
tensor<bool, []>(false)
tensor<int32, []>(1)
tensor<uint64, []>(64)
So right now, ane_compat.h is necessary-but-not-sufficient for M1/M2. The target selection is correct; the MIL body syntax is the remaining barrier.
Also, the cast op between fp32↔fp16 still fails on M1/M2 ANE hardware regardless of syntax — PR #3 added an fp16 I/O fallback for that. This PR has no such mechanism.
| static ANEPlatform g_ane_platform = {0}; | ||
| static bool g_ane_platform_detected = false; | ||
|
|
||
| // ---- Internal helpers ---- |
There was a problem hiding this comment.
Good fix — the original PR #2 declared ane_print_platform(const ANEPlatform *p) but every call site used ane_print_platform() with zero args. Changing to void with internal &g_ane_platform access plus lazy-init guard is the right call. Clean and consistent with ane_mil_target() and ane_peak_tflops().
One minor note: static globals in a header mean each translation unit gets its own copy of g_ane_platform and g_ane_platform_detected. Currently fine (single-file compilation units), but would silently break if the project ever links multiple .m files together — detection would run independently in each TU.
| double ms=bench(c,s,d); | ||
| double tf=ms>0?gf/ms:0; | ||
| if(ms>0)printf("%-28s %6.1f %6.2f %7.3f ms %6.2f %5.1f%%\n",l,w,gf,ms,tf,tf/0.019*100); | ||
| if(ms>0)printf("%-28s %6.1f %6.2f %7.3f ms %6.2f %5.1f%%\n",l,w,gf,ms,tf,tf/ane_peak_tflops()*100); |
There was a problem hiding this comment.
Good catch fixing this — tf/0.019*100 completely defeats the purpose of dynamic chip detection. Using ane_peak_tflops() here is obviously correct. 👍
Review SummaryI see what you mean — this is PR #2's ✅ Bug Fixes (Both Correct)
|
| PR #3 | PR #4 | |
|---|---|---|
| MIL spec version | Hardcoded program(1.0) + ios16 |
Dynamic via ane_compat.h (best per platform) |
| MIL body syntax | ✅ Canonical verbose (tensor<string, []>) |
❌ Still uses M4-only shorthand (string("x"), bool(false), int32(1)) |
| fp16 I/O fallback | ✅ g_fp16_io + auto-retry on compile fail |
❌ None — cast op still fails on M1/M2 |
| Peak TFLOPS | ❌ Hardcoded | ✅ Dynamic per chip |
| Platform detection | ❌ None | ✅ Full chip/macOS/ANE detection |
Bottom line: PR #4 does the smart thing (detect and adapt) but misses the necessary thing (canonical MIL syntax + fp16 fallback). PR #3 does the necessary thing but leaves some optimization on the table.
🎯 Recommended Path
The ideal merge would combine:
- PR fix: MIL syntax + M1/M2 backward compatibility #3's canonical MIL syntax + fp16 I/O fallback (the actual compatibility fix)
- PR Fix compile-breaking ane_print_platform signature + hardcoded peak TFLOPS #4's
ane_compat.hruntime detection + dynamic TFLOPS (the infrastructure to be optimally compatible)
That way you get correct behavior on all Apple Silicon and the ability to use the best available MIL features per platform.
| // Global platform info (set once by ane_detect_platform) | ||
| static ANEPlatform g_ane_platform = {0}; | ||
| static bool g_ane_platform_detected = false; | ||
|
|
There was a problem hiding this comment.
Architecture note worth flagging for future: static globals in a header = every .m file that #includes this gets its own independent copy of g_ane_platform and g_ane_platform_detected. Currently fine since each .m compiles as its own binary, but if anything ever links two translation units together (e.g., a shared helper .m + a main.m), ane_detect_platform() silently runs twice and each TU can diverge.
Not blocking — this project's structure makes it safe today. But a future-proof approach would be moving the globals + detect function to a single .m implementation file and keeping only declarations in the header. Worth a // NOTE: comment at minimum.
| p.ane_peak_tflops = _ane_peak_tflops(p.chip); | ||
|
|
||
| // 4. Select MIL target based on macOS version | ||
| // - macOS 15+ (Sequoia) → ios18 + program(1.3) |
There was a problem hiding this comment.
The MIL target selection logic keys off macos_major alone, which is correct for the current matrix (13→ios16, 14→ios17, 15→ios18). However, this means macOS 16+ (whatever ships with M5-era hardware) will also get ios18 + program(1.3).
That's likely fine as a conservative fallback (newer CoreML should be backward-compatible with ios18 targets), but worth noting that if Apple introduces an ios19 MIL target with macOS 16, this function won't automatically pick it up. The mil_program might also bump (e.g., 1.4).
A defensive option: add // TODO: update when macOS 16 / ios19 target is known here so it doesn't silently regress.
| "tensor<fp16, [1, %d, %d, %d]> k, tensor<fp16, [1, %d, %d, %d]> v) {\n" | ||
| " tensor<fp16, [1, %d, %d, %d]> att = scaled_dot_product_attention(" | ||
| "query = q, key = k, value = v)[name = string(\"sdpa\")];\n" | ||
| " } -> (att);\n}\n", |
There was a problem hiding this comment.
Heads up — the MIL body syntax here still uses M4-era shorthand forms:
string("pt") // should be: tensor<string, []>("pt")
int32(1) // should be: tensor<int32, []>(1)
bool(false) // should be: tensor<bool, []>(false)
uint64(64) // should be: tensor<uint64, []>(64)
Even though ane_compat.h now correctly selects ios16 / program(1.0) on M1/M2 macOS, the CoreML MIL parser on those older systems rejects the shorthand string(...) / int32(...) / bool(...) scalar constructor syntax. The canonical verbose forms (tensor<string, []>(...) etc.) are required for backward compatibility.
PR #3 addressed exactly this across all MIL templates. This file (and most others in this PR) still need that syntax conversion to actually compile on M1/M2.
This is a recurring pattern across all 23 MIL programs in this PR — same issue in every file.
Review — Second PassThe prior review nailed the macro-level analysis (PR #3 vs #4 reconciliation). I'm adding some specific findings from a deeper code read. ✅ Bug fixes: Both correct
🟡 MIL body syntax — still the critical gapHaving now read both PRs side-by-side, I can confirm:
Even with 🔧 Additional code-level notes
🎯 Recommended merge strategyThe strongest path forward is combining both PRs:
That would give you both correctness and optimal adaptation. |
Integrates both PR #3 (M1/M2 canonical verbose MIL syntax + fp16 I/O fallback) and PR #4 (runtime chip/OS detection via ane_compat.h) into a unified solution that works everywhere AND optimizes per-platform. Changes across 16 files: - Add training/ane_compat.h: runtime platform detection library (chip family, macOS version, MIL target selection, peak TFLOPS) - Convert all 38 hardcoded program(1.0) -> program(%s) with g_ane_platform.mil_program dynamic argument - Convert all 44 hardcoded func main<ios16> -> func main<%s> with ane_mil_target() dynamic argument - Replace hardcoded 0.019 TFLOPS constant with ane_peak_tflops() - Add #include ane_compat.h and platform init to 14 consumer files - Preserve PR #3's fp16 I/O auto-retry mechanism for M1/M2 - Use canonical verbose buildInfo syntax (universal compatibility) Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
Integrates both PR #3 (M1/M2 canonical verbose MIL syntax + fp16 I/O fallback) and PR #4 (runtime chip/OS detection via ane_compat.h) into a unified solution that works everywhere AND optimizes per-platform. Changes across 16 files: - Add training/ane_compat.h: runtime platform detection library (chip family, macOS version, MIL target selection, peak TFLOPS) - Convert all 38 hardcoded program(1.0) -> program(%s) with g_ane_platform.mil_program dynamic argument - Convert all 44 hardcoded func main<ios16> -> func main<%s> with ane_mil_target() dynamic argument - Replace hardcoded 0.019 TFLOPS constant with ane_peak_tflops() - Add #include ane_compat.h and platform init to 14 consumer files - Preserve PR #3's fp16 I/O auto-retry mechanism for M1/M2 - Use canonical verbose buildInfo syntax (universal compatibility) Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
Includes all changes from PR #2 (dynamic platform detection for ANE) with two bug fixes applied on top:
🔴 Bug 1:
ane_print_platform()signature mismatch (compile-breaking)ane_compat.h:190declaredstatic void ane_print_platform(const ANEPlatform *p), but all 14 call sites across every.mfile called it with zero arguments — a hard compile error.Fix: Changed to
static void ane_print_platform(void), using&g_ane_platforminternally. This matches the existing pattern ofane_mil_target()andane_peak_tflops(). Added a lazy-init guard for safety.🟡 Bug 2: Hardcoded peak TFLOPS in
inmem_peak.mLine 110 used
tf/0.019*100— a hardcoded 19 TFLOPS value that defeats the purpose of dynamic chip detection.Fix: Replaced with
tf/ane_peak_tflops()*100to use the runtime-detected peak value.Addresses bugs identified in the review of #2. Since that PR originates from a fork and the author is unavailable, this PR applies the fixes directly.
💻 View my work • 👤 Initiated by @dermitchell1993 • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks