For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | BoorishBears's commentsregister

I still debate how much productivity I've gained from better AI compared to the loss from switching off WebStorm

But their tab complete situation is abysmal, and Supermaven got macrophaged by Cursor


Lots of people have tried this and most recently TikTok is trying to become TikTok for games by showing 0 install games in the feed

Is there really no rule that discourages 99% of your interactions with HN from being peddling some useless slop benchmark?

If it's relevant to the discussion, I hope not.

I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.

Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.


It's a great benchmark. Don't listen to the haters. This one is especially interesting.

https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...


This one's even more interesting

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Who knew Anthropic was this far behind???


Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.

Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.

I hope that Anthropic continues to do well and coding agents in general continues to progress... but I also hope Claude Code implodes dramatically and completely so we can get a ground up rebuild with sound engineering.

Every week it seems like we're getting closer.

Bonus: A high profile case might end people fixating on how long they can go without writing any code. Which makes about as much sense as a mechanic fixating on how long they go between snapped bolts without a torque wrench.


> A $30/month subscription is indeed too much, but I see it as a one time payment for that month when I release something, then I pause the subscription. I need it rarely, very few videos need zooming and motion.

If I think something is worth the money, I typically don't need to actively decide to pause the subscription each time I use it.


Right, it’s not worth $30/month all year for me because I don’t use it past demo videos for when I publish a new app or large update. Which happens rarely.

But if I was that kind of user who did demos monthly, the time saved on one or two videos that month is worth $30.


The commenter you're replying to said he needs it only occasionally. It makes perfect sense to pause a subscription if you don't use it. Not doing so would be a waste of money. How can you critisise that, don't be ridiculous

Well specifically a congressperson got it to hallucinate stuff about them then wrote an agry letter

But I checked and it's there... but in the UI web search can't be disabled (presumably to avoid another egg on face situation)


It does not matter at all, especially when talking about Qwen, who've been caught on some questionable benchmark claims multiple times.

Becnhmarks are a pox on LLMs.

You can use this model for about 5 seconds and realize its reasoning is in a league well above any Qwen model, but instead people assume benchmarks that are openly getting used for training are still relevant.


They really are. Benchmaxxing is real… but also the Qwen 3.5 series of models are still very impressive. I’m looking forward to trying out Gemma

Definitely have to use each model for your use case personally, many models can train to perform better on these tests but that might not transfer to your use case.

Except they already did this: if they had scaled 4.5 with RL, 5 would probably have been the leap we expected

If anything 4.5 being abandoned so they could sell India a $3 a month subscription was the first crack in The Box


Did you really mean to say 4.5? Gpt 4.5 used to cost $75/$150 per million tokens input/output. And it did not even seem to be that good to justify that. I would not expect many people were using it, and I doubt that "expanding to india" was what killed it (if it was that useful/popular they would have kept the api, or keep it for higher end subscriptions).

If anything it should have been no1 in the "openAI graveyard" website.


India in this context is a synecdoche for scaling consumer vs Anthropic's more enterprise-y route, but yes that's pretty much why we didn't get 4.5 with reasoning. Without reasoning, 4.5 had no future.

From Sam Altman himself:

> We had this big GPU crunch. We could go make another giant model. We could go make that, and a lot of people would want to use it, and we would disappoint them. And so we said, let’s make a really smart, really useful model, but also let’s try to optimize for inference cost. And I think we did a great job with that.

4.5 scaled into a unified reasoning model would have been an incredible model. It beat GPT-5 on accuracy and hallucinations without reasoning (!)

It just wouldn't have worked for powering things like ChatGPT Go's rollout and loginless chatgpt.com, so they dropped it.

(And if you want, you could argue it's the compute crunch that didn't let them do both... but Anthropic had to make the same choices at the time and went in the other direction.)


This all sounds like pure speculation to me. GPT4.5 was ok but not spectacular. The whole marketing was based on "vibes" and how interacting with it "felt more natural" etc. If there was actual use case for this model, I do not see why it would not be just offered for higher end subscriptions or through API. Other expensive models at the time, eg o1/o3 pro, were not served in the free tier, but only in paid subscriptions and apis, but that one did have use cases, so they did keep it at the time, until they prob took a more unified approach with their models. So I do not see why they could not have done something similar with 4.5 if it was an actually good model.

And I am not sure that Altmat's statements are worth taking into account. His statements are more about marketing and turning things in his favour rather than speaking the truth.


I suspect GPT-5 models are sparser and/or smaller than Opus which is why they can afford to give away so much usage.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You