I think opus does in fact, find the bugs the same way GPT xhigh (or even high) does. It just discards them before presenting to the user.
Opus is designed to be lazy, corner-cutting model. Reviews are just one place where this shows. In my orchestration loop, opus discards many findings by GPT 5.4 xhigh, justifying this as pragmatism. Opus YAGNIs everything, GPT wants you to consider seismic events in your todo list app. There's sadly, nothing in between.
> A trivial example is that you can improve performance in a very simple way: ask "are you sure?" showing the model what it intends to do, BEFORE doing it. Improves performance by 10%
Put it into the "are you sure" loop and you'll see the model will just keep oscillating for eternity. If you ask the model to question the output, it will take it as an instruction that it must, even if the output is correct.
Not in my experience. I mean, it happens. But models can check if their own function calls are reasonable. And that doesn't require dropping the context cache, so it's a lot less expensive than you would probably initially think.
I've been using the OpenAI Agents SDK for a while now and am largely happy with the abstractions - handoffs/sub-agents, tools, guardrails, structured output, etc. Building the infra and observability, and having it scale reliably, was a bigger pain for me. So I do get Anthropic's move into managed agents.
Agreed, but it's a bit nuanced. I'm working on a fairly complex project now in a domain where I have no technical experience. The first iteration of the project was complete garbage, but it was garbage mainly because I asked for things to be done and never asked HOW it should be done. Result? Complete, utter garbage. It kinda, sorta worked, but again, I would never use it in anything important.
Then we went through ~10 complete rewrites based on the learnings from previous attempts. As we went through these iterations, I became much more knowledgeable of the domain - because I saw failure points, I read the resulting code and because I asked the right questions.
Without AI, I would likely have given up after iteration 2, and certainly would not do 10 iterations.
So the nuance here is that iterating and throwing away the entire thing is going to become much cheaper, but not without an engineer being in the loop, asking the right questions.
Note: each iteration went through dual reviews of codex and opus at each phase with every finding fixed and review saying everything is perfect, the best thing on earth.
I'm seeing similar process but on large teams still finding this output to be unmaintainable.
The problem is that vanishingly few people actually understand the code and are asking the agents to do all of the interpretation and reasoning for them.
This code that you've built is only maintainable for as long as you are still around at the company to work on it -- it's essentially a codebase that you're the only domain expert in. That's not a good outcome for companies either.
My prediction is that the companies that learn this lesson are the ones that are going to stick around. LLMs won't be in wide use for features but for throwaway busy-work type problems that eat lots of human resources and can't be ignored.
I left my last company job just before "AI-first engineering" became mainstream, and you confirmed what I was feeling all this time - I have absolutely zero idea how teams actually manage to collaborate with LLM-managed projects. All the projects that I'm working now are my own and the only reason why I could do this is because I had unlimited time and unlimited freedom. There's no chance I would be able to do this in a team setting.
I'm positive that the last company's CEO probably mandates by now that nobody must write a single line of code by hand and there's likely some rigid process everyone has to follow.
I agree and commiserate. In the near term my picture is pretty grim. There's fantastic uses for these tools but they're being abused.
I was big on correctness, software safety (think medical devices, not memory) and formal proofs anyway, so I think I'm just going to take the pay cut and start selecting for those types of jobs. Your run of the mill SaaS or open source+commercial companies are all becoming a death march.
The day I start freaking out about my job is the day when my non-engineer friend turned vibe coder understands how, or why the thing that AI wrote works. Or why something doesn't work exactly the way he envisioned and what does it take to get it there.
If it can replace SWEs, then there's no reason why it can't replace say, a lawyer, or any other job for that matter. If it can't, then SWE is fine. If it can - well, we're all fucked either way.
> If it can replace SWEs, then there's no reason why it can't replace say, a lawyer
SWE is unique in that for part of the job it's possible to set up automated verification for correct output - so you can train a model to be better at it. I don't think that exists in law or even most other work.
What is the automated verification of correct output and who defines that?
But before verification, what IS correct output?
I understand SWE process is unique in that there are some automations that verify some inputs and outputs, but this reasoning falls into the same fallacies that we've had before AI era. First one that comes to mind is that 100% code coverage in tests means that software is perfect.
Right, and that's why it's only part of the job. The benchmarks they're currently doing compose of the AI being handed a detailed spec + tests to make pass which isn't really what developing a feature looks like.
Going from fuzzy under-defined spec to something well defined isn't solved.
Going from well defined spec to verification criteria also isn't.
Once those are in place though, we get https://vinext.io - which from what I understand they largely vibe-coded by using NextJS's test suite.
> First one that comes to mind is that 100% code coverage in tests means that software is perfect
I agree.. but I'm also not sure if software needs to be perfect
I chuckle every time <insert any LLM company here> says something in line of "the model is so good that we won't release it to general public, ekhm, because safety".
Because the exact same thing has been said on every single upcoming model since GPT 3.5.
At this point, this must be an inside joke to do this just because.
This how Anthropic is marketing their AI releases and the reality is, they are terrified of local AI models competing against them.
Almost everyone on this thread is falling for the same trick they are pulling and not asking why are their benchmarks and research after training new models not independently verified but always internal to the company.
So it is just marketing wrapped around creating fear to get local AI models banned.
Yep, this is exactly it. Open source models and especially ones that run locally are catching up and it's literally an existential threat to these companies. Local models are now quite useful (Qwen, Gemma) and open weight models running on cheaper clouds are perfectly sufficient for use by responsible software engineers to use for building software. You can take your pick of Kimi 2.5, GLM 5.1, and the soon to be released Deepseek 4 which might end up above Opus levels as it stands for a fifth of the cost. Anthropic is particularly vulnerable here, since their entire marketshare rests on the developer market. There is a reason why Google for example, is not so concerned with this and is perfectly happy releasing open models which cut into their own marketshare, and to a lesser extend, same with OpenAI. Anthropic has bet the house on software development which is why we see increasing desperation to both lobby for regulation on open/local models and to wall off their coding harness and subscription plans.
The only people who are "cooked" are those who rely on SOTA models to function in their jobs, and companies who are desperate to regulate open / local models to maintain their marketshare.
common sense and interviewing around did. no one wants to hire someone that is not AI native anymore, unless you are looking at positions that pay peanuts
Thats pretty funny actually, but I'm not the one going around telling everyone they're cooked if they don't adopt my expensive workflow which disenfranchises myself from my work and makes me more replaceable.
That probably matters for some scenarios, but I have yet to find one where thinking tokens didn't hint at the root cause of the failure.
All of my unsupervised worker agents have sidecars that inject messages when thinking tokens match some heuristics. For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).
> also whenever "pre-existing issue" appears (it's never pre-existing)
I dunno... There were some pre-existing issues in my projects. Claude ran into them and correctly classified as pre-existing. It's definitely a problem if Claude breaks tests then claims the issue was pre-existing, but is that really what's happening?
> For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).
It's so weird to see language changes like this: Outside of LLM conversations, a pragmatic fix and a correct fix are orthogonal. IOW, fix $FOO can be both.
From what you say, your experience has been that a pragmatic fix is on the same axis as a correct fix; it's just a negative on that axis.
It's contextual though, and pragmatic seems different to me than correct.
For example, if you have $20 and a leaking roof, a $20 bucket of tar may be the pragmatic fix. Temporary but doable.
Some might say it is not the correct way to fix that roof. At least, I can see some making that argument. The pragmatism comes from "what can be done" vs "should be".
From my perspective, it seems viable usage. And I guess on wonders what the LLM means when using it that way. What makes it determine a compromise is required?
(To be pragmatic, shouldn't one consider that synonyms aren't identical, but instead close to the definition?)
> It's contextual though, and pragmatic seems different to me than correct.
To me too, that's why I say they are measurements on different dimensions.
To my mind, I can draw a X/Y axis with "Pragmatic" on the Y and "Correctness" on the X, and any point on that chart would have an {X,Y} value, which is {Pragmatic, Correctness}.
If I am reading the original comment correctly, poster's experience of CC is that it is not an X/Y plot, it is a single line plot, with "Pragmatic" on the extreme left and "Correctness" on the extreme right.
Basically, any movement towards pragmatism is a movement away from correctness, while in my model it is possible to move towards Pragmatic while keeping Correctness the same.
I don't think it's a single axis even in the original poster's conception, since you could be both incorrect and also not pragmatic.
But if a fix needs to be described as pragmatic relative to the alternatives, that's probably because it couldn't be described as correct. Otherwise you wouldn't be talking about how pragmatic it is.
I had some interesting experience to the opposite last night, one of my tests has been failing for a long time, something to do with dbus interacting with Qt segfaulting pytest. Been ignoring it for a long time, finally asked claude code to just remove the problematic test. Come back a few minutes later to find claude burning tokens repeatedly trying and failing to fix it. "Actually on second thought, it would be better to fix this test."
Match my vibes, claude. The application doesn't crash, so just delete that test!
The problem with specs for me is always with boundaries. How many specs do you have for a complex project? How do they reference each other? What happens when requirements cross boundaries?
Gemini cli is horrible though.
reply