More

rafaelmn · 2026-04-07T18:56:26 1775588186

GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.

sho_hn · 2026-04-07T19:22:56 1775589776

Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.

Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.

It's annoying, too, because I don't much like OpenAI as a company.

(Background: 25 years of C++ etc.)

boring-human · 2026-04-07T21:09:54 1775596194

Same background as you, and same exact experience as you. Opus and Gemini have not come close to Codex for C++ work. I also run exclusively on xhigh. Its handling of complexity is unmatched.

At least until next week when Mythos and GPT 6 throw it all up in the air again.

Jcampuzano2 · 2026-04-07T19:03:12 1775588592

Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.

But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.

camdenreslink · 2026-04-07T22:05:55 1775599555

ChatGPT 5.4 with extra high reasoning has worked really well for me, and I don't notice a huge difference with Opus 4.6 with high reasoning (those are the 2 models/thinking modes I've used the most in the last month or so).

leobuskin · 2026-04-07T19:02:27 1775588547

And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus

aizk · 2026-04-08T01:57:48 1775613468

I've been messing with using Claude, Codex, and Kimi even for reverse engineering at https://decomp.dev/ it's a ton of fun. Great because matching bytes is a scoring function that's easy for the models to understand and make progress on.

adamgoodapp · 2026-04-08T07:34:04 1775633644

I want to get into RE with AI. Which model you liking the most?

blazespin · 2026-04-07T20:48:53 1775594933

Yeah, need some good RE benchmarks for the LLMs. :)

RE is very interesting problem. A lot more that SWE can be RE'd. I've found the LLMs are reluctant to assist, though you can workaround.

porker · 2026-04-07T21:02:50 1775595770

What is RE in this context?

astrange · 2026-04-07T21:16:48 1775596608

Reverse engineering

19h · 2026-04-07T23:53:49 1775606029

Mind sharing the use cases you're using IDA via MCP for?

zarzavat · 2026-04-07T19:05:03 1775588703

Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.

chaos_emergent · 2026-04-07T19:32:49 1775590369

An alternative but similar formulation of that statement is that Anthropic has spent more training effort in getting the model to “feel good” rather than being correct on verifiable tasks. Which more or less tracks with my experience of using the model.

zarzavat · 2026-04-08T11:50:07 1775649007

Alignment is a subspace of capability. Feeling good is nice, but it's also a manifestation of the level that the model can predict what I do and don't want it to do. The more accurately it can predict my intentions without me having to spell them out explicitly in the prompt, the more helpful it is.

GPT-5 is good at benchmarks, but benchmarks are more forgiving of a misaligned model. Many real world tasks often don't require strong reasoning abilities or high intelligence, so much as the ability to understand what the task is with a minimal prompt.

Not every shop assistant needs a physics degree, and not every physics professor is necessarily qualified to be a shop assistant. A person, or LLM, can be very smart while at the same time very bad at understanding people.

For example, if GPT-5 takes my code and rearranges something for no reason, that's not going to affect its benchmarks because the code will still produce the same answers. But now I have to spend more time reviewing its output to make sure it hasn't done that. The more time I have to spend post-processing its output, the lower its capabilities are since the measurement of capability on real world tasks is often the amount of time saved.

lilytweed · 2026-04-07T19:17:34 1775589454

Whenever I come back to ChatGPT after using Claude or Gemini for an extended period, I’m really struck by the “AI-ness.” All the verbal tics and, truly, sloppishness, have been trained away by the other, more human-feeling models at this point.

kranke155 · 2026-04-07T21:59:00 1775599140

GPT was clearly changed after its sycophantic models lead to the lawsuits.

josephg · 2026-04-07T23:10:10 1775603410

It still has a very ... plastic feeling. The way it writes feels cheap somehow. I don't know why, but Claude seems much more natural to me. I enjoy reading its writing a lot more.

That said, I'll often throw a prompt into both claude and chatgpt and read both answers. GPT is frequently smarter.

kranke155 · 2026-04-08T10:14:34 1775643274

GPT is more accurate. But Claude has this way of association between things that seems smarter and more human to me.

whalesalad · 2026-04-07T18:59:57 1775588397

This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.

ctoth · 2026-04-07T19:35:51 1775590551

My favorite example of this from last night:

Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.

Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!

rafaelmn · 2026-04-06T20:09:00 1775506140

TBH Claude Code is surprisingly shit to use given the technical resources and the amount of money behind it. Looking past the bugs and missing features, it's so obvious it's not built by people who care about the product from a developer/craftsman perspective. It's missing all the signs of polish/care, it feels like someone shipped an internal PoC to prod and kept hacking on it. And now they are just tacking on features to sell more buzzwords and internal prototypes. Classic user facing/commercial software story.

But we (the dev community) are kind of spoiled, because we have a lot of great developer tools that come from people passionate about their work, skilled at what they do and take pride in what they put out. I don't count myself among one of those people but I have benefited from their work throughout my career and have gotten used to it in my tooling.

All that being said Opus is hands down the best coding model for me (and I'm actively trying all of them) and I'll tolerate it as long as I can get it to do what I need, even with the warts and annoyances.

KronisLV · 2026-04-06T21:02:07 1775509327

> TBH Claude Code is surprisingly shit to use given the technical resources and the amount of money behind it. Looking past the bugs and missing features, it's so obvious it's not built by people who care about the product from a developer/craftsman perspective. It's missing all the signs of polish/care, it feels like someone shipped an internal PoC to prod and kept hacking on it.

I don't wholly disagree, but personally it's still the tool I use and it's sort of fine. Perhaps not entirely for the money that's behind it, as you said, but it could be worse.

The CLI experience is pretty okay, although the auth is kinda weird (e.g. when trying to connect to AWS Bedrock). There's a permission system and sandboxing, plan mode and TODOs, decent sub-agent support, instruction files and custom skills, tool calls and LSP support and all the other stuff you'd expect. At least no weird bugs like I had with OpenCode where trying to paste multi-line content inside of a Windows Terminal session lead to the tool closing and every next line getting pasted in an executed in the terminal one by one, that was weird, though I will admit that using Windows feels messed up quite often nowadays even without stuff like that.

The desktop app gives you chat and cowork and code, although it almost feels like Cowork is really close to what Code does (and for some reason Cowork didn't seem to support non-OS drives?). Either way, the desktop app helps me not juggle terminal sessions and keeps a nice history in the sidebar, has a pretty plan display, easy ways of choosing permissions and worktrees, although I will admit that it can be sluggish and for some actions there just aren't progress indicators which feels oddly broken.

I wonder what they spend most of their time working on and why the basics aren't better, though to Anthropic's credit about a month ago the desktop Code section was borderline unusable on Windows when switching between two long conversations, which now seems to take a few seconds (which is still a few seconds too long, but at least usable).

joenot443 · 2026-04-06T20:33:29 1775507609

> TBH Claude Code is surprisingly shit to use given the technical resources and the amount of money behind it.

What harness would you recommend instead?

mm263 · 2026-04-06T20:34:51 1775507691

pi, oh-my-pi, opencode - none of them have subsidized Claude though

ch4s3 · 2026-04-06T20:38:36 1775507916

Opencode can't lazy load skills, mcps, or agents and has limitations on context. It's a total nonstarter from my experience using it at work.

oblio · 2026-04-06T20:50:46 1775508646

Isn't Anthropic basically a bunch of AI PhDs writing code? I'd imagine they had to be dragged kicking and screaming into actual software engineering.

crooked-v · 2026-04-06T20:32:23 1775507543

The most obvious sign to me from the start that somebody wasn't really paying attention to how the Claude app(s) work is that on iOS, you have to leave the app active the entire time a response is streaming or it will error out.

glhaynes · 2026-04-06T20:51:55 1775508715

I never saw that bug, I don't think, but there was one where it had to start the response before you switched away. That's thankfully been fixed for a few weeks.

johanyc · 2026-04-06T20:55:24 1775508924

That has never happened to me. Did you try that a long while ago? I only had that issue with Gemini

oblio · 2026-04-06T20:52:22 1775508742

Don't worry, the Gemini website does the same, at least on Firefox mobile. You switch apps and you lose the response AND the prompt.

Normally some software devs should be fired for that.

rafaelmn · 2026-04-06T19:14:53 1775502893

Ticket is AI generated but from what I've seen these guys have a harness to capture/analyze CC performance, so effort was made on the user side for sure.

notatallshaw · 2026-04-06T20:02:51 1775505771

The note at the end of the post indicates the user asked Claude to review their own chat logs. It's impossible to tell if Claude used or built a a performance harness or just wrote those numbers based on vibes.

dreadnip · 2026-04-07T06:43:57 1775544237

The whole issue is very obviously LLM generated nonsense. The stats are way too specific and reinforce the user’ bias in typical hallucinated fashion.

gardnr · 2026-04-06T20:38:07 1775507887

There is this 3rd party tracker: https://marginlab.ai/trackers/claude-code/

rafaelmn · 2026-04-05T10:47:48 1775386068

Or it's just way easier to implement this way and they don't want to waste time on stuff only HN crowd cares about ?

bakugo · 2026-04-05T11:13:25 1775387605

Implementing Play Integrity is something developers have to go out of their way to do. Not implementing it requires literally zero effort. So no, it's not easier to do it this way.

kackerlacker · 2026-04-05T13:53:59 1775397239

One could say the same thing about virus scanners. They are obviously too little too late "security" so standards that require them have given up on real requirements like a way to achieve actual assurance of no buffer overflows. Nonetheless, an implementation to such a standard that chooses any off the shelf scanner is a lot less work than implementing a new scanner.

rafaelmn · 2026-04-03T00:38:46 1775176726

You expect someone to ship you an OS personalized to your taste and preferences ?

jazzypants · 2026-04-03T01:00:31 1775178031

This should clearly be editable in a GUI somewhere.

rafaelmn · 2026-04-01T14:19:46 1775053186

Lol if CC is the advantage that's the larges indictment of AI coding there is. Don't get me wrong CC gives me good results, but I very much doubt their tooling is great, they just spew tokens at the model and the model is quite good at making sense of it and following through.

I suspect they have better RL setup for coding that makes their models better at coding than GPT/Gemini in practice.

rafaelmn · 2026-03-31T18:15:31 1774980931

Not tmux related at all had it happen in all kinds of setups (alacritty/linux, vscode terminal macos)

rafaelmn · 2026-03-31T18:12:24 1774980744

I mean if you want glitchy garbage that works in the happy path mostly then game engine is the right foundation to build on. Software quality is the last thing game devs are known for. The whole industry is about building clever hacks to get something to look/feel a certain way, not building robust software that's correct to some spec.

FartyMcFarter · 2026-03-31T18:23:24 1774981404

Can confirm (used to work in the games industry). Code reviews and automatic testing of any kind are a rare sight.

spencerflem · 2026-03-31T18:30:24 1774981824

In my experience games crash a lot less often than the windows file explorer

I feel like we give what’s some pretty impressive engineering short shrift because it’s just for entertainment

theLiminator · 2026-03-31T19:22:36 1774984956

I'd posit that the average game dev is significantly more skilled than the average dev.

rafaelmn · 2026-03-27T20:43:10 1774644190

It's shocking how shitty claude code CLI app is - config is brittle shit (setting up a plugin LSP is searching through GitHub issues and guessing which parameters you messed up), hooks render errors in the app when there are none and the permission harness is barely documented, zero customization options (would you like the agent config come from a different folder than source root ? nope). Going through gihub issues, same issue you hit has been open since beginning of 2025 and ignored - their issues are /dev/null - it's basically a user forum.

rafaelmn · 2026-03-27T11:40:23 1774611623

Your battery is going to suffer because of the extra ram as well.

I don't know your workloads, but for me personally 64 GB is the ceiling buffer on RAM - I can run entire k8s cluster locally with that and the M5 Pro with top cores is same CPU as M5 Max. I don't need the GPU - the local AI story and OSS models are just a toy for my use-cases and I'm always going to shell out for the API/frontier capabilities. I'm even thinking of 48 config because they already have those on 8% discounts/shipped by Amazon and I never hit that even on my workstation with 64 GB.

zozbot234 · 2026-03-27T11:53:34 1774612414

> Your battery is going to suffer because of the extra ram as well.

No, it won't. The power drain of merely refreshing DRAM is negligible, it's no higher than the drain you'd see in S3 standby over the same time period.

3form · 2026-03-27T12:30:40 1774614640

Given the DRAM refresh is part of S3 standby, I'm afraid this is circular reasoning.

cduzz · 2026-03-27T13:25:57 1774617957

I suspect this is one of those "it depends" situations; does the 128gb vs 64gb sku have more chips or denser chips? If "more chips" probably it'll draw a tiny bit more power than the smaller version. If the "denser" chips, it may be "more power draw" but such a tiny difference that it's immaterial.

Similarly, having more cache may mean less SSD activity, which may mean less energy draw overall.

If I had a chip to put on the roulette table of this "what if" I'd put it on the "it won't make a difference in the real world in any meaningful way" square.

ryandrake · 2026-03-27T15:22:13 1774624933

I thought my Z620 with 128GB of RAM was excessive! Actually, HP says they support up to 192GB of RAM, but for whatever reason the machine won't POST with more than 128GB (4Rx4) in it. Flawed motherboard?

HN For You