fix: MIL syntax + M1/M2 backward compatibility#3
fix: MIL syntax + M1/M2 backward compatibility#3codegen-sh[bot] wants to merge 2 commits intomainfrom
Conversation
Port upstream PR #6 (imperatormk) - fixes MIL scalar type syntax from M4-only shorthand to canonical verbose format that compiles on all Apple Silicon (M1/M2/M3/M4). Changes: - program(1.3) to program(1.0), ios18 to ios16 target - Scalar type shorthand to canonical verbose format - Simplified buildInfo dict (no M4-specific version strings) - fp16 I/O fallback: g_fp16_io flag with auto-retry on compile failure for M1/M2 where cast op is unsupported - Dynamic IOSurface byte calculation (bpe: 2 for fp16, 4 for fp32) Tested on M1 Pro, macOS 26.3 (per upstream PR author).
| #include <math.h> | ||
|
|
||
| // Set by caller: 1 = fp16 I/O (M1/M2 fallback, no cast ops), 0 = fp32 I/O with cast (M4+) | ||
| extern int g_fp16_io; |
There was a problem hiding this comment.
🐛 Linker bug: ane_mil_gen.h declares extern int g_fp16_io, and its static functions (mil_gen_conv, mil_gen_matmul, etc.) read it. train.m includes this header transitively via backward.h → model.h → ane_mil_gen.h, but never defines g_fp16_io.
Since the Makefile compiles train.m as a single translation unit (xcrun clang ... -o train train.m), this will produce an undefined symbol linker error for _g_fp16_io.
Fix: Add int g_fp16_io = 0; (non-static) at file scope in train.m, plus the same retry-on-compile-failure logic the other files have. Alternatively, refactor so model.h's model_compile_kernels handles the fallback internally.
|
|
||
| static mach_timebase_info_data_t g_tb; | ||
| static double ticksToMs(uint64_t t) { return (double)t * g_tb.numer / g_tb.denom / 1e6; } | ||
| static int g_fp16_io = 0; // M1/M2: cast op unsupported, use fp16 I/O directly |
There was a problem hiding this comment.
Nit: the static keyword here gives inmem_peak.m its own private copy, which is fine since this file doesn't include ane_mil_gen.h. But it's worth noting the dual pattern: test files use static int g_fp16_io (file-local), while ane_mil_gen.h expects extern int g_fp16_io (shared). Both work in isolation — just something to be aware of for maintainability.
Would be cleaner long-term to pass fp16_io as a parameter to the MIL generation functions rather than relying on global mutable state.
| [fm removeItemAtPath:td error:nil]; | ||
| if (!g_fp16_io) { | ||
| printf("[ANE] fp32 compile failed, retrying with fp16 I/O (M1/M2 fallback)\n"); | ||
| g_fp16_io = 1; |
There was a problem hiding this comment.
The auto-retry with goto retry_compile is solid engineering — compile once with fp32 cast path, and if it fails (M1/M2 hardware), automatically retry with direct fp16 I/O. Clean fallback mechanism. 👍
One edge case to consider: if g_fp16_io gets set to 1 here but a subsequent benchmark call with different params also fails, the recursive bench(ch, sp, depth) call will now attempt fp16 directly and return -3 on failure without resetting g_fp16_io. That's probably fine since if the hardware doesn't support casts, it never will — just noting it.
Review SummarySolid port of upstream PR #6 — the MIL syntax normalization and M1/M2 backward compatibility are well-engineered. The auto-fallback from fp32 cast path to direct fp16 I/O is a particularly nice touch. Here's what I found: 🐛 Bug (Blocking)
Fix: Add 📝 Suggestions (Non-blocking)
✅ What looks good
|
train.m includes ane_mil_gen.h (via backward.h -> model.h) which declares extern int g_fp16_io, but train.m never defined it -- producing an undefined symbol linker error. Changes: - train.m: add g_fp16_io = 0 at file scope, wrap model_compile_kernels with auto-retry (try fp32, on fail set g_fp16_io=1, retry fp16) - model.h: compile_conv_kernel IOSurface byte calculation now uses g_fp16_io ? 2 : 4 (was hardcoded to 4) - .gitignore: add train binary + test/probe binaries
|
✅ Fixed blocking linker error + IOSurface byte mismatch in Changes:
|
Integrates both PR #3 (M1/M2 canonical verbose MIL syntax + fp16 I/O fallback) and PR #4 (runtime chip/OS detection via ane_compat.h) into a unified solution that works everywhere AND optimizes per-platform. Changes across 16 files: - Add training/ane_compat.h: runtime platform detection library (chip family, macOS version, MIL target selection, peak TFLOPS) - Convert all 38 hardcoded program(1.0) -> program(%s) with g_ane_platform.mil_program dynamic argument - Convert all 44 hardcoded func main<ios16> -> func main<%s> with ane_mil_target() dynamic argument - Replace hardcoded 0.019 TFLOPS constant with ane_peak_tflops() - Add #include ane_compat.h and platform init to 14 consumer files - Preserve PR #3's fp16 I/O auto-retry mechanism for M1/M2 - Use canonical verbose buildInfo syntax (universal compatibility) Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
PR Review SummaryOverall Assessment: Excellent foundation — superseded by PR #5 (dream merge) Strengths of PR #3
Limitation Identified
ResolutionPR #5 integrates PR #3's canonical syntax + fp16 fallback with PR #4's runtime platform detection ( |
Community-submitted results for M1 Pro/Max, M3 Pro, M4 Pro/Max, M5. Includes training performance, peak throughput, MIL compatibility matrix, and structured JSON data.
|
|
Integrates both PR #3 (M1/M2 canonical verbose MIL syntax + fp16 I/O fallback) and PR #4 (runtime chip/OS detection via ane_compat.h) into a unified solution that works everywhere AND optimizes per-platform. Changes across 16 files: - Add training/ane_compat.h: runtime platform detection library (chip family, macOS version, MIL target selection, peak TFLOPS) - Convert all 38 hardcoded program(1.0) -> program(%s) with g_ane_platform.mil_program dynamic argument - Convert all 44 hardcoded func main<ios16> -> func main<%s> with ane_mil_target() dynamic argument - Replace hardcoded 0.019 TFLOPS constant with ane_peak_tflops() - Add #include ane_compat.h and platform init to 14 consumer files - Preserve PR #3's fp16 I/O auto-retry mechanism for M1/M2 - Use canonical verbose buildInfo syntax (universal compatibility) Co-authored-by: dermitchell1993 <dmitchell1993@aliasvault.net>
|
If you're so keen on 'backporting' code then see https://github.com/imperatormk/ane-train
|
|
@imperatormk Thanks for the pointer to ane-train — solid framework. Just backported three things from it into PR #8 (
Also brought over your The full modules/ops library (backward pass + Adam optimizer on ANE) is noteworthy but would be a separate integration effort. |
|
🔄 View updated PR #8 on Github 🔄 Checked out imperatormk/ane-train and backported the most impactful pieces to ane_runtime.h:
The full ops library (backward pass + Adam on ANE, 26+ module headers) is substantial — worth a dedicated follow-up if you want to integrate it. 💻 View my work • 🛑 Stop • React 👍 or 👎 |
Summary
Ports upstream PR #6 by @imperatormk — fixes MIL scalar type syntax from M4-only shorthand to the canonical verbose format that CoreML's compiler emits, enabling compilation on all Apple Silicon (M1/M2/M3/M4).
What changed
MIL Syntax (all files)
program(1.3)program(1.0)func main<ios18>(...)func main<ios16>(...)string("x")tensor<string, []>("x")bool(true)tensor<bool, []>(true)int32(1)tensor<int32, []>(1)fp16(0.5)tensor<fp16, []>(0.5)uint64(64)tensor<uint64, []>(64)fp16 I/O Fallback (
inmem_peak.m,training/ane_mil_gen.h)M1/M2 ANE hardware cannot execute the
castop between fp32↔fp16. This adds:g_fp16_ioglobal flag — when set, generates MIL with direct fp16 I/O (no cast ops)bpe: 2 for fp16, 4 for fp32)buildInfo
Simplified from verbose M4-specific version dict to minimal
coremlc-versiononly.Files modified (16)
.gitignore— new (build artifact exclusions)inmem_peak.m— benchmark with fp16 fallback + retry logictraining/ane_mil_gen.h— MIL generation helpers (dual fp16/fp32 paths)training/stories_mil.h— model layer MIL generation (canonical syntax)Testing
💻 View my work • 👤 Initiated by @dermitchell1993 • About Codegen