Bypassing CoreML to natively train a 110M Transformer on the Apple Neural Engine (Orion)
It is hard to communicate how frustrating the current Apple ML stack is for low-level research.
Academic or research source. Check the methodology, sample size, and whether it's been replicated.
Key Takeaways
- CoreML imposes opaque abstractions that prevent direct ANE programming and do not support on-device training.
- Despite having up to 38 TOPS (INT8) and \~19 TFLOPS of fp16 compute, the ANE remains almost entirely unused for large language model workloads.
- Building on the foundational hardware reverse-engineering by maderix (who mapped the private API surface and benchmarked the 32 MB SRAM cliff), I w
What It Means
Context
CoreML imposes opaque abstractions that prevent direct ANE programming and do not support on-device training. Despite having up to 38 TOPS (INT8) and \~19 TFLOPS of fp16 compute, the ANE remains almost entirely unused for large language model workloads. Building on the foundational hardware reverse-engineering by maderix (who mapped the private API surface and benchmarked the 32 MB SRAM cliff), I w
For builders
CoreML imposes opaque abstractions that prevent direct ANE programming and do not support on-device training.
For Builders
CoreML imposes opaque abstractions that prevent direct ANE programming and do not support on-device training.