Once a model is stable and good enough, for example Sonnet 4.6 or GPT 5.4 (or something else in future), it can be burned into hardware like Talaas chip reducing the cost many times and increasing the speed. At some point we can rely on old model while being productive with it.
I always wondered why the equivalent of integrated mining didn't apply to LLM inference... now it turns out it does and there's a company making it fast and robust!
An ASIC for bitcoin mining makes more sense, in that the algorithm is basically “set.” For LLMs, it is hard while models are still developing.
But, sounds like Taalas is trying to strike an interesting balance where they can at least spin up ASICs for new models reasonably quickly with their modular design. It’s a really interesting bet, and might pay off.
No, burning models into hardware won't make them faster or reduce the cost. It will cost way more for similar performance as what you would get with a gpu. I am not telling you why, you can go figure that out on your own.
With some research, that chip appears like it would cost about $300-$400 to manufacture, die only.
For an 8B parameter model.
Opus is estimated at 500B-2T parameters. At that scale you’re past reticle limits and need HBM and multi-die packaging, which means you’ve essentially built an inference ASIC (like Groq or Etched) rather than something categorically cheaper than GPUs. The “burned into silicon” advantage mostly evaporates at frontier scale.
The cutting edge, max size models will likely stay in the GPU space for a long time.
But these models are not needed for most general requests.
With a fine tuned 30B quantisized model you can serve a large portion of requests with around 32GB of RAM.
Free users will likely only get these kinds of models.
At some point we will get these models in hardware and the cost per token will be minimal.
> With a fine tuned 30B quantisized model you can serve a large portion of requests with around 32GB of RAM. Free users will likely only get these kinds of models.
These are exactly the kinds of models that you can easily run locally by repurposing existing hardware. Depending on how much you're willing to wait for the answer, running local even gives you strictly better outcomes for simple Q&A queries.
(Long-context and agentic use cases are admittedly much harder to fit under that model, since non-AI uses for the high-end hardware you'd realistically need for those are rather more limited, and they're hit by the ongoing hardware shortage.)
For programmers maybe.
I do this too.
But think about all the regular users out there.
Your dad and your mum, maybe even your grandparents.
This is a huge marked too and for that we can use these special chips at scale.
Does the cost scale linearly/superlinearly? What does the $300-$400 price data point tell us with relationship to the parameter density?
No gotchas here. I genuinely don't know that 8B parameters is in a zone with significant decreasing marginal returns -- too far out of my knowledge area but genuinely curious.
Die size increases cost exponentially, by decreasing chips per wafer and decreasing yield.
I expect that this kind of burned-in model is also very difficult to verify (how do you know if some of the weights are off), and not amenable to partial disablement to increase yield. For CPUs, you just laser disable bad cores. Can't forego part of a neural net.
You can ablate surprisingly large chunks of a model with near to no effect, you can try this easily - download an open weight model in torch.
Obviously it’s not ideal but you could likely have single digit % of all weights affected and still have a useful model (many caveats here: e.g. locality of damaged weights matters, distribution of errors matters, fail high/low matters, …)
I mean, you probably can just turn off defective parts of the network. You better believe if this becomes popular they would salvage yields by selling "dumber" chips at a discount.
There’s a lot of tradeoffs to play with, those inference ASICs may not carry the gradient but they are still optimised for larger batches and to run any model. They need enough memory for the weights, wide batch inference, and ideally leftovers for kv cache efficiency.
For personal inference you’re given a lot more room to play in - much of it poorly explored today - enough to concern an argument of cost advantages evaporating
I just tried chatjimmy.ai for a bit and while it is absolutely blazingly fast, it's also not a very strong model. I suppose that with time, stronger models will be able to run on such hardware, too.
Oh wow! So they make dummy hospitals and put dummy meat bags of all sizes for camera time and social media post just to make Israel look bad when they hit those meat bags. That is some strategy.
Nobody said they are dummy hospitals. They are dual use, some medical, some military HQ. And nobody said they were dummy meat bags. The most powerful weapon the terrorists have is dead civilians. And you get what you reward: punish Israel for dead civilians, you'll get more dead civilians.
How is this one better? I thought this was going to be a visual editor where you click and edit on the diagram itself. I don't seem to be able to do that here.
Thank you very much for sharing this article. I have been having issues with my second monitor which is connected to my my laptop making it 3 screens. It was very annoying having to replug it to dock everytime it decided to turn off. I have also been feeling less productive for quite a while now.
After reading this, I have let the second one stay off and then unplugged and I can already notice the difference a lot. I didn't switch between apps much or procrastinated as much. It's only been a day or two and I have yet to see how I fare in long term. For now, I am happy.
reply