I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.
Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.
Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.
Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.
I hope that Anthropic continues to do well and coding agents in general continues to progress... but I also hope Claude Code implodes dramatically and completely so we can get a ground up rebuild with sound engineering.
Every week it seems like we're getting closer.
Bonus: A high profile case might end people fixating on how long they can go without writing any code. Which makes about as much sense as a mechanic fixating on how long they go between snapped bolts without a torque wrench.
> A $30/month subscription is indeed too much, but I see it as a one time payment for that month when I release something, then I pause the subscription. I need it rarely, very few videos need zooming and motion.
If I think something is worth the money, I typically don't need to actively decide to pause the subscription each time I use it.
Right, it’s not worth $30/month all year for me because I don’t use it past demo videos for when I publish a new app or large update. Which happens rarely.
But if I was that kind of user who did demos monthly, the time saved on one or two videos that month is worth $30.
The commenter you're replying to said he needs it only occasionally. It makes perfect sense to pause a subscription if you don't use it. Not doing so would be a waste of money. How can you critisise that, don't be ridiculous
You can use this model for about 5 seconds and realize its reasoning is in a league well above any Qwen model, but instead people assume benchmarks that are openly getting used for training are still relevant.
Definitely have to use each model for your use case personally, many models can train to perform better on these tests but that might not transfer to your use case.
Did you really mean to say 4.5? Gpt 4.5 used to cost $75/$150 per million tokens input/output. And it did not even seem to be that good to justify that. I would not expect many people were using it, and I doubt that "expanding to india" was what killed it (if it was that useful/popular they would have kept the api, or keep it for higher end subscriptions).
If anything it should have been no1 in the "openAI graveyard" website.
India in this context is a synecdoche for scaling consumer vs Anthropic's more enterprise-y route, but yes that's pretty much why we didn't get 4.5 with reasoning. Without reasoning, 4.5 had no future.
From Sam Altman himself:
> We had this big GPU crunch. We could go make another giant model. We could go make that, and a lot of people would want to use it, and we would disappoint them. And so we said, let’s make a really smart, really useful model, but also let’s try to optimize for inference cost. And I think we did a great job with that.
4.5 scaled into a unified reasoning model would have been an incredible model. It beat GPT-5 on accuracy and hallucinations without reasoning (!)
It just wouldn't have worked for powering things like ChatGPT Go's rollout and loginless chatgpt.com, so they dropped it.
(And if you want, you could argue it's the compute crunch that didn't let them do both... but Anthropic had to make the same choices at the time and went in the other direction.)
This all sounds like pure speculation to me. GPT4.5 was ok but not spectacular. The whole marketing was based on "vibes" and how interacting with it "felt more natural" etc. If there was actual use case for this model, I do not see why it would not be just offered for higher end subscriptions or through API. Other expensive models at the time, eg o1/o3 pro, were not served in the free tier, but only in paid subscriptions and apis, but that one did have use cases, so they did keep it at the time, until they prob took a more unified approach with their models. So I do not see why they could not have done something similar with 4.5 if it was an actually good model.
And I am not sure that Altmat's statements are worth taking into account. His statements are more about marketing and turning things in his favour rather than speaking the truth.
But their tab complete situation is abysmal, and Supermaven got macrophaged by Cursor
reply