ANE private API research: chaining, E5 runtime, custom MIL compilation#40
Open
dev-erik wants to merge 1 commit intomaderix:mainfrom
Open
ANE private API research: chaining, E5 runtime, custom MIL compilation#40dev-erik wants to merge 1 commit intomaderix:mainfrom
dev-erik wants to merge 1 commit intomaderix:mainfrom
Conversation
f9c7183 to
be80e51
Compare
be80e51 to
dff5a68
Compare
dff5a68 to
65d7813
Compare
…pilation experiments
65d7813 to
99ba013
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
We reverse-engineered three paths to direct ANE access on macOS 15 (M4 Max):
_ANEChainingRequest(legacy) -- multi-kernel pipelining API. We got it to validate but it requires Espresso IR that the in-memory MIL path cannot produce. Dead-end on macOS 15+.MLE5Engine) -- the modern ANE execution path used by CoreML internally. We validated its behaviour and found that CoreML'sMLDelegateModelcaching outperforms direct engine calls.MLE5ProgramLibraryOnDeviceAOTCompilationImpl. We verified attention, linear layers, full transformer blocks, and backward pass matmuls all execute correctly on ANE hardware.Bottom line: Custom MIL compilation is the viable path for direct ANE compute. The legacy chaining API is obsolete on macOS 15+. Training on ANE is theoretically possible but impractical due to read-only weights requiring recompilation (~10-50ms) after every update.
Experiments & Results
Phase 1: ChainingRequest API (Experiments A-P)
Systematically probed 12+ private Obj-C classes to understand the
_ANEChainingRequestpipeline for multi-kernel pipelining (running ANE ops back-to-back without CPU round-trips)._ANEChainingRequest.validate_ANEBuffer(wraps IOSurface withsymbolIndex) instead of_ANEIOSurfaceObject_ANEIOSurfaceOutputSetsstatsSurRefprepareChainingWithModel:_ANEModel(disk-compiled Espresso IR), crashes with_ANEInMemoryModel_ANEClient.evaluateRealTimeWithModel:evaluateWithQoS:at small dims (64x64); no advantage at production dims (768x256)_ANESharedSignalEvent/_ANESharedWaitEventIOSurfaceSharedEvent, work withMTLSharedEventANEProgramChainingPrepare Failed)Phase 2: E5 Runtime Validation (Experiments W1-W5)
Validated the modern E5 execution path that CoreML uses internally on macOS 15+.
MLModel->MLDelegateModel->MLE5Engine->MLE5ExecutionStream->e5rt_program_libraryMLDelegateModelcachingMLE5Engine.predictionFromFeatures:directly due to internal stream/operation caching_executeStream:MLE5ExecutionStreamOperationobjects (handle=0x0) these are no-ops -- output validation is criticalMLE5ProgramLibrary,MLE5StaticShapeExecutionStreamOperationPool,MLE5ProgramLibraryOnDeviceAOTCompilationImplPhase 3: Custom MIL -> ANE Execution (Experiments X1, Y1-Y3, Z1)
Breakthrough: write MIL text, compile to
e5rt_program_library, execute on ANE viaMLE5Engine.scaled_dot_product_attention(self-attn, 4 heads)linearwith embedded const weights (64x32 -> 64x16)The compilation pipeline: MIL text ->
MLE5ProgramLibraryOnDeviceAOTCompilationImpl->createProgramLibraryHandleWithRespecialization:->MLE5ProgramLibrary->MLE5Engine(7-arg init) ->predictionFromFeatures:.Additional Benchmarks
Files Added
Test Programs
training/test_chaining_v2.m_ANEChainingRequestand 12+ private ANE classes. Dumps methods, type encodings, properties. Tests_ANEBuffer,_ANEIOSurfaceOutputSets,_ANEProgramIOSurfacesMapper,_ANESharedSignalEvent/WaitEvent. Benchmarks standard vs RT eval paths.training/test_ane_model.m_ANEModelfactory methods,_ANECompilercompilation,prepareChainingWithModel:crash investigation,_ANEInputBuffersReady/_ANEOutputSetEnqueuetype encoding,_ANEProgramForEvaluation.processRequest, shared event construction, IOSurface mapper exploration.training/test_coreml_chaining.mMLModel compileModelAtURL:) to extract_ANEModelobjects. Tests_ANEBuffercreation withsymbolIndex,_ANEIOSurfaceOutputSetswith stats surfaces, chaining request validation,prepareChainingWithModel:with various parameter combinations.training/test_e5_validate.mMLE5EngineandMLE5ProgramLibraryfrom compiled CoreML models. Tests_executeStream:with fabricated operations. ProfilesMLDelegateModelvs directMLE5Engine. Dumps all E5 class methods and properties.training/test_mil_custom.mcompileAndCreateEnginehelper (the full MIL -> ANE pipeline),findE5Containerfor extractingMLProgramE5Container. Runs SDPA, linear-with-weights, full transformer block, and backward pass matmul -- all verified against CPU reference implementations.training/test_throughput_ceiling.mtraining/test_bench_paths.mevaluateWithQoS:vs RTevaluateRealTimeWithModel:vsprocessRequestat three dimension sets (64x32, 256x128, 768x256). Shows RT advantage disappears at production dims.Documentation
docs/ANE_CHAINING_RESEARCH.mddocs/ANE_INTERNALS.mdModified Files
training/Makefiletraining/ane_runtime.hane_eval_rt()-- wrapper for_ANEClient.evaluateRealTimeWithModel:with fallback to standard evalMIL Syntax Lessons Learned
These are non-obvious and not documented by Apple anywhere:
layer_normepsilon type must match gamma/beta dtype (fp16, notfp32)matmulrequires bothtranspose_xandtranspose_yasboolconstsconcatrequiresinterleave(bool) param andaxisasint32scalar (not tensor)MLE5Engineuses a 7-argument initializer:initWithProgramLibrary:modelDescription:configuration:functionName:classProbabilitiesFeatureName:optionalInputDefaultValues:compilerVersionInfo:MLProgramE5Containercan be created viainitWithModelAssetPath:configuration:from a.mlmodelcpath~/Library/Caches/<binary_name>/for ANE specialization cacheMLModelDescriptionpassed toMLE5EnginecastopsBuild & Run
No external dependencies. System frameworks only (Foundation, CoreML, IOSurface, Metal, Accelerate). Requires macOS 15+ on Apple Silicon.