Have you benchmarked against other 3-bit dynamic quants like Unsloth? I am sorry but this framing against a full precision, newer, smaller MoE just seems misleading. Also, Gemma-4-26B-A4B is not the SOTA for edge. Even at launch, that would be the 31B.
I can't find it. Can you state your performance versus comparable 3-bit quantization from Unsloth/Bartowski? Edit: I appreciate that you seem to have open-sourced the quantization pipeline. This is not to question your work, but to understand where the outputs stand relative to the SoTA for quantization.
Anthropic better get that IPO out soon. Their incredible revenue run-up was basically a result of botched Gemini releases and OpenAI having their hands-tied behind their Azure backs.
Anthropic models were quite literally the only viable serverless API (i.e. Bedrock) models on AWS. They didn't even bother releasing the recent Qwen 3.5/3.6 series. Combined with the token efficiency/ROI focus, I would really like to see how Antrhopic ends Q3.
Are you comparing single-user requests or multiple concurrent requests when you say comparable to rented GPU? Most of the cost efficiencies kick in with concurrent/batch requests. A single H100 node can provide like 5k input + 2k output tok/s on a model like Qwen 3.6 35B-A3B with 30+ concurrent requests.
Let's say the government can't care for 100M people because of lack of doctors. Now they could train one over 10 years, or you could have one of the smartest doctors in the world come be 100M+1. Would you take that?
Now expand that across socio-economic spectrum (not enough plumbers, teachers, AI experts, researchers etc). That is what legal immigration is meant for.
But if the justification for immigration is prior immigration, is there a stopping point here? Like, after you import a bunch of doctors, is it going to turn out that now you need a bunch of fast food workers, back and forth?
>Let's say the government can't care for 100M people because of lack of doctors. Now they could train one over 10 years, or you could have one of the smartest doctors in the world come be 100M+1. Would you take that?
But that is not what usually happens, right? What usually happens is that some hospital employs a doctor educated from some other country where standard of education is less, instead of someone who is educated from native institutions, because they accept to work for 10x less salary. In this case both the US Society as well as the US educated doctor losses, and the US Hospital and the migrant gains.
Feel free to expand this across socio-economic spectrum..
>Let's say the government can't care for 100M people because of lack of doctors.
Then the government is proven to be severely incompetent and shouldn't be trusted with more migration because it will guaranteed fumble that too. Barring mass migration, populations don't naturally just explode overnight for you to suddenly end up with 100 million people and no doctors.
Governments have all the tools and data at their disposal to see population trends, piramid, emigration, immigration, job statistics, housing, etc. all this data you can use and plot out to determine how many doctors you'll need in the future as the population follows the trajectory and plan training and recruitment of doctors ahead of time so that when population reaches 100 million or 500 million there will be an proportional number of doctors.
So then why didn't the government do this preemptively when they had all the info and levers at their disposal? Could it be because they simply don't give a shit and they only care about winning the next election and not what happens in 20+ years when the population reaches 100 million and there's no doctors? Because they won't be in charge then when the shit hits the fan so they don't care to be preemptive for something that's not a pressing issue now. So then given this, why would you trust these same people with enabling mass migration on your behalf? They clearly don't care about the long term future planning and second order effects of their actions.
> or you could have one of the smartest doctors in the world come be 100M+1. Would you take that?
In which case do 1 million of doctors and only doctors and nothing else but doctors show up at your borders because if that were the case I guarantee you everyone would take them in no questions asked.
That's the classic bait and switch. Merkel also told Germans they're getting "doctors and engineers" in 2015 and the only thing that increases is sexual assaults rates, crime and welfare spending to the point where "doctors and engineers" became a meme phrase for migrant crime in the news.
>That is what legal immigration is meant for.
In theory yes, but just like Germany, in practice the system has always been abused to dupe voters to accept anything other than doctors so that corporations can get cheap labor and landlords more tenants. The overton window has gotten so bad on this topic that if you complain about migrant crime, they'll maliciously ask you back "but what about doctors, you don't want them either?". No, we want doctors, We just want the doors shut to people who aren't doctors, it's really that simple.
Nobody in government wants to listen. They're too afraid of being called "far right" if they give airtime to people wanting less migration so what happens is that that only breeds resentment and a rise of the actual far right, leading to a self fulfilling prophecy that could be avoided if they just enable discourse to things they don't like to hear but are pressing issues of large voter base.
>Seems like your issue isn't immigration, it is abuse.
Because governments don't make the distinction between the two, you give them an inch, they take a mile. If you let them enable any migration over time they will abuse it to flood the market with cheap labor as their corporate lobbyists push them to, like they did with H1B since 1990, which was initially a a scheme to import top soviet scientist after the USSR collapsed(kind of like operation paperclip) and is now used for US companies to import "Microsoft Certified Specialists" from India.
If you want to stop the abuse you have to stop migration completely and then start political negotiations from that point of leverage, on a controlled points based migration system that accounts for actual shortages and domestic public resources.
Subjective, but if we compare to compute not everyone needs the most expensive laptops or super computers for their work.
I think frontier models will be invaluable for scientific research, defense, financial analysis and such. But the average person probably would be reasonably well-served with a local model.
If you're in sales, customer service, product management and such - the leading open models at the 30B mark are already good enough.
The title is Apple Silicon costs LESS than OpenRouter. Not sure why it got updated to this - maybe because I referenced the original HN post?
Here's the full post:
TLDR; When you consider batching, cache and input tokens, together with the residual cost of Macbook Pro is actually 14% cheaper than OpenRouter. This becomes a whooping 3x (i.e. 65%) cheaper if you consider MoE models like Gemma 4 26B.
There was a well-meaning post yesterday by @DataDrivenAngel comparing costs of self-hosting LLMs v/s using OpenRouter (HN link). The analysis however had a few flaws as pointed out by the HN community, and I ran benchmarks on my M4 Max 128GB to adjust for those.
1. The estimate was based entirely using output tokens, instead of real-world input-output token mix. The numbers look very different if you consider a 4:1 or 5:1 input to output token ratio.
2. Batching/concurrency/caching improves token throughput, and if you're running multiple coding agents/work trees the performance gain can be significant.
3. A Macbook Pro is an asset purchase, and retains significant residual value through it's life. Probably not unreasonable to expect ~1.5-2.5k resale value after 3-5 years of use.
I ran vllm bench using a resonable approximation for a coding agent workload with concurrency 4 for Gemma 4 31B (same as the original post), and got the following results:
Scenario
3 years $0.15 Local cheaper (~6%)
5 years $0.14 Local cheaper (~13%)
7 years $0.13 Local cheaper (~19%)
-----------------------------------
Once you work out the math (using original assumptions on power costs and 5 year timeline), you get to a blended cost of ~$0.14 per million tokens for local, v/s ~$0.16 for OpenRouter. That is not a massive win. But it is close enough to flip the narrative from local being more expensive to 'it depends'.
But it doesn't end there. If you used an MoE model like Gemma 4 26B, the blended cost drops to $0.038 per million tokens, v/s OpenRouter's $0.1 per million. That is a ~3x difference.
Scenario
3 years $0.040 Local cheaper (~60%)
5 years $0.038 Local cheaper (~62%)
7 years $0.035 Local cheaper (~65%)
-----------------------------------
This is not meant as an attack on the original analysis - I am sure the synthetic bench I used has a few holes, plus buying price/residual value varies a fair bit. Plus, I don't think anybody will run their MBP for inference for 5 years straight.
But with worsening GPU supply and the inevitable price/access squeeze, I think local LLMs have a huge role to play. And this is on top of the privacy benefits. A misperceived price differential should not be the reason that slows down adoption.
The tweet does not make clear what the power cost assumptions are? That is wildly variable and important! For some people it may be, perhaps not for others.
For future reference, after posting you have a couple minutes to undo any of those auto-shenanigans. I always check, and undo whatever appears to be a silly regex.
YC is super AI-forward regarding request for startups, so it feels like about time that this became an LLM-based thing. LLMs do have their uses.
note: would be hilarious if this was the result of an LLM fail, using a new system. I am a regex muggle, but could a normal regex implementation have even fumbled like that?
reply