Bypassing CoreML to natively train a 110M Transformer on the Apple Neural Engine (Orion)

It is hard to communicate how frustrating the current Apple ML stack is for low-level research.

Reddit MachineLearning · Mar 05, 2026 05:51 UTC · ~4 min read

Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

Key Takeaways

CoreML imposes opaque abstractions that prevent direct ANE programming and do not support on-device training.
Despite having up to 38 TOPS (INT8) and \~19 TFLOPS of fp16 compute, the ANE remains almost entirely unused for large language model workloads.
Building on the foundational hardware reverse-engineering by maderix (who mapped the private API surface and benchmarked the 32 MB SRAM cliff), I w

What It Means

Context

CoreML imposes opaque abstractions that prevent direct ANE programming and do not support on-device training. Despite having up to 38 TOPS (INT8) and \~19 TFLOPS of fp16 compute, the ANE remains almost entirely unused for large language model workloads. Building on the foundational hardware reverse-engineering by maderix (who mapped the private API surface and benchmarked the 32 MB SRAM cliff), I w

For builders

CoreML imposes opaque abstractions that prevent direct ANE programming and do not support on-device training.

For Builders

CoreML imposes opaque abstractions that prevent direct ANE programming and do not support on-device training.

Read Original