For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | rafaelmn's commentsregister

GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.

Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.

Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.

It's annoying, too, because I don't much like OpenAI as a company.

(Background: 25 years of C++ etc.)


Same background as you, and same exact experience as you. Opus and Gemini have not come close to Codex for C++ work. I also run exclusively on xhigh. Its handling of complexity is unmatched.

At least until next week when Mythos and GPT 6 throw it all up in the air again.


Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.

But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.


ChatGPT 5.4 with extra high reasoning has worked really well for me, and I don't notice a huge difference with Opus 4.6 with high reasoning (those are the 2 models/thinking modes I've used the most in the last month or so).

And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus

I've been messing with using Claude, Codex, and Kimi even for reverse engineering at https://decomp.dev/ it's a ton of fun. Great because matching bytes is a scoring function that's easy for the models to understand and make progress on.

I want to get into RE with AI. Which model you liking the most?

Yeah, need some good RE benchmarks for the LLMs. :)

RE is very interesting problem. A lot more that SWE can be RE'd. I've found the LLMs are reluctant to assist, though you can workaround.


What is RE in this context?

Reverse engineering

Mind sharing the use cases you're using IDA via MCP for?

Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.

An alternative but similar formulation of that statement is that Anthropic has spent more training effort in getting the model to “feel good” rather than being correct on verifiable tasks. Which more or less tracks with my experience of using the model.

Alignment is a subspace of capability. Feeling good is nice, but it's also a manifestation of the level that the model can predict what I do and don't want it to do. The more accurately it can predict my intentions without me having to spell them out explicitly in the prompt, the more helpful it is.

GPT-5 is good at benchmarks, but benchmarks are more forgiving of a misaligned model. Many real world tasks often don't require strong reasoning abilities or high intelligence, so much as the ability to understand what the task is with a minimal prompt.

Not every shop assistant needs a physics degree, and not every physics professor is necessarily qualified to be a shop assistant. A person, or LLM, can be very smart while at the same time very bad at understanding people.

For example, if GPT-5 takes my code and rearranges something for no reason, that's not going to affect its benchmarks because the code will still produce the same answers. But now I have to spend more time reviewing its output to make sure it hasn't done that. The more time I have to spend post-processing its output, the lower its capabilities are since the measurement of capability on real world tasks is often the amount of time saved.


Whenever I come back to ChatGPT after using Claude or Gemini for an extended period, I’m really struck by the “AI-ness.” All the verbal tics and, truly, sloppishness, have been trained away by the other, more human-feeling models at this point.

GPT was clearly changed after its sycophantic models lead to the lawsuits.

It still has a very ... plastic feeling. The way it writes feels cheap somehow. I don't know why, but Claude seems much more natural to me. I enjoy reading its writing a lot more.

That said, I'll often throw a prompt into both claude and chatgpt and read both answers. GPT is frequently smarter.


GPT is more accurate. But Claude has this way of association between things that seems smarter and more human to me.

This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.

My favorite example of this from last night:

Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.

Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!


TBH Claude Code is surprisingly shit to use given the technical resources and the amount of money behind it. Looking past the bugs and missing features, it's so obvious it's not built by people who care about the product from a developer/craftsman perspective. It's missing all the signs of polish/care, it feels like someone shipped an internal PoC to prod and kept hacking on it. And now they are just tacking on features to sell more buzzwords and internal prototypes. Classic user facing/commercial software story.

But we (the dev community) are kind of spoiled, because we have a lot of great developer tools that come from people passionate about their work, skilled at what they do and take pride in what they put out. I don't count myself among one of those people but I have benefited from their work throughout my career and have gotten used to it in my tooling.

All that being said Opus is hands down the best coding model for me (and I'm actively trying all of them) and I'll tolerate it as long as I can get it to do what I need, even with the warts and annoyances.


> TBH Claude Code is surprisingly shit to use given the technical resources and the amount of money behind it. Looking past the bugs and missing features, it's so obvious it's not built by people who care about the product from a developer/craftsman perspective. It's missing all the signs of polish/care, it feels like someone shipped an internal PoC to prod and kept hacking on it.

I don't wholly disagree, but personally it's still the tool I use and it's sort of fine. Perhaps not entirely for the money that's behind it, as you said, but it could be worse.

The CLI experience is pretty okay, although the auth is kinda weird (e.g. when trying to connect to AWS Bedrock). There's a permission system and sandboxing, plan mode and TODOs, decent sub-agent support, instruction files and custom skills, tool calls and LSP support and all the other stuff you'd expect. At least no weird bugs like I had with OpenCode where trying to paste multi-line content inside of a Windows Terminal session lead to the tool closing and every next line getting pasted in an executed in the terminal one by one, that was weird, though I will admit that using Windows feels messed up quite often nowadays even without stuff like that.

The desktop app gives you chat and cowork and code, although it almost feels like Cowork is really close to what Code does (and for some reason Cowork didn't seem to support non-OS drives?). Either way, the desktop app helps me not juggle terminal sessions and keeps a nice history in the sidebar, has a pretty plan display, easy ways of choosing permissions and worktrees, although I will admit that it can be sluggish and for some actions there just aren't progress indicators which feels oddly broken.

I wonder what they spend most of their time working on and why the basics aren't better, though to Anthropic's credit about a month ago the desktop Code section was borderline unusable on Windows when switching between two long conversations, which now seems to take a few seconds (which is still a few seconds too long, but at least usable).


> TBH Claude Code is surprisingly shit to use given the technical resources and the amount of money behind it.

What harness would you recommend instead?


pi, oh-my-pi, opencode - none of them have subsidized Claude though

Opencode can't lazy load skills, mcps, or agents and has limitations on context. It's a total nonstarter from my experience using it at work.

Isn't Anthropic basically a bunch of AI PhDs writing code? I'd imagine they had to be dragged kicking and screaming into actual software engineering.

The most obvious sign to me from the start that somebody wasn't really paying attention to how the Claude app(s) work is that on iOS, you have to leave the app active the entire time a response is streaming or it will error out.

I never saw that bug, I don't think, but there was one where it had to start the response before you switched away. That's thankfully been fixed for a few weeks.

That has never happened to me. Did you try that a long while ago? I only had that issue with Gemini

Don't worry, the Gemini website does the same, at least on Firefox mobile. You switch apps and you lose the response AND the prompt.

Normally some software devs should be fired for that.


Ticket is AI generated but from what I've seen these guys have a harness to capture/analyze CC performance, so effort was made on the user side for sure.

The note at the end of the post indicates the user asked Claude to review their own chat logs. It's impossible to tell if Claude used or built a a performance harness or just wrote those numbers based on vibes.

The whole issue is very obviously LLM generated nonsense. The stats are way too specific and reinforce the user’ bias in typical hallucinated fashion.

There is this 3rd party tracker: https://marginlab.ai/trackers/claude-code/

Or it's just way easier to implement this way and they don't want to waste time on stuff only HN crowd cares about ?

Implementing Play Integrity is something developers have to go out of their way to do. Not implementing it requires literally zero effort. So no, it's not easier to do it this way.

One could say the same thing about virus scanners. They are obviously too little too late "security" so standards that require them have given up on real requirements like a way to achieve actual assurance of no buffer overflows. Nonetheless, an implementation to such a standard that chooses any off the shelf scanner is a lot less work than implementing a new scanner.

You expect someone to ship you an OS personalized to your taste and preferences ?

This should clearly be editable in a GUI somewhere.

Lol if CC is the advantage that's the larges indictment of AI coding there is. Don't get me wrong CC gives me good results, but I very much doubt their tooling is great, they just spew tokens at the model and the model is quite good at making sense of it and following through.

I suspect they have better RL setup for coding that makes their models better at coding than GPT/Gemini in practice.


Not tmux related at all had it happen in all kinds of setups (alacritty/linux, vscode terminal macos)

I mean if you want glitchy garbage that works in the happy path mostly then game engine is the right foundation to build on. Software quality is the last thing game devs are known for. The whole industry is about building clever hacks to get something to look/feel a certain way, not building robust software that's correct to some spec.

Can confirm (used to work in the games industry). Code reviews and automatic testing of any kind are a rare sight.

In my experience games crash a lot less often than the windows file explorer

I feel like we give what’s some pretty impressive engineering short shrift because it’s just for entertainment


I'd posit that the average game dev is significantly more skilled than the average dev.

It's shocking how shitty claude code CLI app is - config is brittle shit (setting up a plugin LSP is searching through GitHub issues and guessing which parameters you messed up), hooks render errors in the app when there are none and the permission harness is barely documented, zero customization options (would you like the agent config come from a different folder than source root ? nope). Going through gihub issues, same issue you hit has been open since beginning of 2025 and ignored - their issues are /dev/null - it's basically a user forum.

Your battery is going to suffer because of the extra ram as well.

I don't know your workloads, but for me personally 64 GB is the ceiling buffer on RAM - I can run entire k8s cluster locally with that and the M5 Pro with top cores is same CPU as M5 Max. I don't need the GPU - the local AI story and OSS models are just a toy for my use-cases and I'm always going to shell out for the API/frontier capabilities. I'm even thinking of 48 config because they already have those on 8% discounts/shipped by Amazon and I never hit that even on my workstation with 64 GB.


> Your battery is going to suffer because of the extra ram as well.

No, it won't. The power drain of merely refreshing DRAM is negligible, it's no higher than the drain you'd see in S3 standby over the same time period.


Given the DRAM refresh is part of S3 standby, I'm afraid this is circular reasoning.

I suspect this is one of those "it depends" situations; does the 128gb vs 64gb sku have more chips or denser chips? If "more chips" probably it'll draw a tiny bit more power than the smaller version. If the "denser" chips, it may be "more power draw" but such a tiny difference that it's immaterial.

Similarly, having more cache may mean less SSD activity, which may mean less energy draw overall.

If I had a chip to put on the roulette table of this "what if" I'd put it on the "it won't make a difference in the real world in any meaningful way" square.


I thought my Z620 with 128GB of RAM was excessive! Actually, HP says they support up to 192GB of RAM, but for whatever reason the machine won't POST with more than 128GB (4Rx4) in it. Flawed motherboard?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You