A lot of trial and error. I've built graphical tools with GD in PHP, the difficult part for me what that the coordinates where inverted..
I only knew how to draw lines and pixels, but I got the job done.
I remember the LinkedIn app that got all your contacts from your phone and tried to add them to your network. I had random people from internet-deals (local craigslist) that where popping up. So strange that this was allowed.
This is exactly the problem we're trying to solve. The models themselves have gotten surprisingly capable at small sizes, Qwen3.5 4B with 262K context, LFM2 1.2B for fast tool calling, but the inference infrastructure hasn't kept up.
When people say "local AI is too slow," they usually mean the engine is too slow, not the model. A 4B model at 186 tok/s (MetalRT on M4 Max) feels genuinely responsive for interactive chat. The same model at 87 tok/s (llama.cpp) feels sluggish. Same weights, same quality, 2x the speed, that's a usability cliff.
We think the gap between cloud and on-device inference is a infrastructure problem, not a model problem. That's what we're working on.
reply