More

smpanaro · 2025-07-15T15:43:00 1752594180

In practice, how often do the models use the ANE? It sounds like you are optimizing for speed which in my experience always favors GPU.

AlekseiSavin · 2025-07-15T15:57:28 1752595048

You're right, modern edge devices are powerful enough to run small models, so the real bottleneck for a forward pass is usually memory bandwidth, which defines the upper theoretical limit for inference speed. Right now, we've figured out how to run computations in a granular way on specific processing units, but we expect the real benefits to come later when we add support for VLMs and advanced speculative decoding, where you process more than one token at a time

J_Shelby_J · 2025-07-15T16:34:08 1752597248

VLMs = very large models?

mmorse1217 · 2025-07-15T16:39:16 1752597556

Probably vision language models.

smpanaro · 2025-05-03T19:54:08 1746302048

coremltools is the only way to run on ANE, so less of a trick and more of a requirement.

The tricks are more around optimizing for the hardware capabilities/constraints. For instance:

- conv2d is faster than linear (see Apple's post [0]) so you rewrite the model for that (example from the repo [1])

- inputs/outputs are static shapes, so KV cache requires some creativity (I wrote about that here [2])

- compute is float16 (not bfloat16) so occasionally you have to avoid activation overflows

[0]: https://machinelearning.apple.com/research/neural-engine-tra...

[1]: https://github.com/Anemll/Anemll/blob/4bfa0b08183a437e759798...

[2]: https://stephenpanaro.com/blog/kv-cache-for-neural-engine

thadk · 2025-05-04T13:26:31 1746365191

Sounds like M2-era onward have bfloat16: https://eclecticlight.co/2024/01/13/how-m1-macs-may-lag-behi...

anemll · 2025-05-04T14:22:11 1746368531

Yes for GPU, however ANE only supports FP16 plus integers. M4/A17 added accelerated int8 that is twice faster than FP16

smpanaro · 2025-05-03T19:14:44 1746299684

What do you mean by less wide? The main bottleneck for transformers is memory bandwidth. ANE has a much lower ceiling than CPU/GPU (yes, despite unified memory).

Chunking is actually beneficial as long as all the chunks can fit into the ANE’s cache. It speeds up compilation for large network graphs and cached loads are negligible cost. On M1 the cache limit is 3-4GB, but it is higher on M2+.

conradev · 2025-05-03T23:48:51 1746316131

I was referring to both the lower memory bandwidth and lower FLOPs. The GPU can just do… more at once? For now. Or is that changing?

I had also assumed that loading a chunk from the cache was not free because I’ve seen cache eviction on my M1, but it’s good to know that it’s no longer as big of a limitation.

also, I’m a big fan of your work! I played around with your ModernBERT CoreML port a bit ago

smpanaro · 2025-05-04T03:15:41 1746328541

For single batch inference of anything remotely LLM you'll hit the memory bound way before FLOPs, so I haven't actually looked at FLOPs much. For raw performance GPU is certainly better. ANE is more energy efficient, but you need larger batches to really benefit.

Maybe cache is the wrong word. This is a limit to how much can be mmap'd for the ANE at once. It's not too hard to hit on M1 if your model is in the GB range. Chunking the model into smaller pieces makes it more likely to "fit", but if it doesn't fit you have to unmap/remap in each forward pass which will be noticeable.

Awesome to hear about ModernBERT! Big fan of your work as well :)

anemll · 2025-05-04T14:28:48 1746368928

Right.I was thinking about it, you still need batch refill, however, Apple Core ML tools were failing for attention activations quantization. Long context, pre-fill is still compute bound.

smpanaro · 2025-05-03T19:05:32 1746299132

Not a public follow-up but the iOS 17 speech-to-text model has a clever approach to KV caching that works within the ANE’s constraints (fixed size inputs).

I wrote about it here[0] but the gist is you can have a fixed size cache and slide it in chunks with each inference. Not as efficient as a cache that grows by one each time of course.

[0]: https://stephenpanaro.com/blog/inside-apples-2023-transforme...

kamranjon · 2025-05-04T13:26:22 1746365182

Hey I just wanted to say that this is an amazing write up and I'm bookmarking your blog cause there isn't a ton of information out there about this stuff as it related to Apple hardware and you do a really great job of explaining many of the concepts that I'm wasn't already familiar with. Thank you!

smpanaro · on April 26, 2024

I bet these can all run on ANE. I’ve run gpt2-xl 1.5B on ANE [1] and WhisperKit [2] also runs larger models on it.

The smaller ones (1.1B and below) will be usably fast and with quantization I suspect the 3B one will be as well. GPU will still be faster but power for speed is the trade-off currently.

[1] 7 tokens/sec https://x.com/flat/status/1719696073751400637 [2] https://www.takeargmax.com/blog/whisperkit

anentropic · on April 26, 2024

indeed, but probably not as written currently?

i.e they would need converting with e.g. your work in more-ane-transformers

smpanaro · on Feb 21, 2024

Has perplexity fallen out of favor? I didn't see it mentioned anywhere. I tried using lm-eval for the 2B model but the results seem wrong (46.1288).

smpanaro · on Jan 6, 2024

The jump was due to them fixing a bug. There’s a footnote about it on the bottom of page 5.

In the Discord, they mentioned a TinyLLaMa v2, presumably that would have this bug (and another bug, footnote page 4) fixed.

smpanaro · on Jan 3, 2024

MobileVLM [1] is another recent small multimodal model. They trained their own 1.4B/2.7B LLaMa from scratch using RedPajama and Vicuna instead of leveraging Phi-2.

The papers only have one common benchmark (GQA, MobileVLM scores better) so hard to say how they compare otherwise.

[1] https://arxiv.org/abs/2312.16886

smpanaro · on Dec 6, 2023

> Multi-device: Operations can run on any of the supported devices (currently, the CPU and GPU).

Probably reading into this too much, but is this hinting at future Neural Engine support?

It’d be nice to access that without CoreML.

smpanaro · on Oct 12, 2023

You can do autoregressive decoding with KV caching on the Neural Engine. You have to make a bit of a trade off and use fixed size inputs [1] but the speed up over no caching is meaningful.

There's a Whisper (Encoder-Decoder) [2] implementation if you want to see it in practice. Shameless plug, but I have a repo [3] where I'm working on autoregressive text generation on the Neural Engine. I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching). Will push an update soon.

Without quantization you can't go much higher than 1.5B params on M1's Neural Engine. M2 seems to have a higher ceiling but I haven't measured. I'm optimistic (but have not tried) that the new runtime quantization added to CoreML this year will allow for larger (and maybe faster) models on both.

[1] Technically you should be able to use 1 input with an enumerated set of sizes but I haven't been able to get it to work on the Neural Engine. This would likely be even faster. [2] https://github.com/wangchou/whisper.coreml/ [3] https://github.com/smpanaro/more-ane-transformers/

cypress66 · on Oct 12, 2023

>I'm running gpt2-xl (1.5B params) locally with KV caching at 120ms/token (vs. 450ms without caching).

That seems very slow compared to llama cpp?

smpanaro · on Oct 12, 2023

Yeah, I believe it is. You trade off speed for lower power usage and CPU. 8 tokens/sec is usable though.

HN For You