I caught myself saying “you’re absolutely right” to my wife last night, unironically. This was 100% not in my vocabulary six months ago.
If I spend 40 hours a week talking to anybody, some of their language or mannerisms are going to rub off on me. I can’t think of a compelling reason why a human-sounding chat bot would be any different.
Almost two decades ago I watched all of Farscape in under two weeks during a college winter break. I often still reflexively say "frell" instead of "fuck".
Another one I noticed is "or maybe I hallucinated that" instead of "or maybe I dreamed that". Researchers will be horrified to learn that even talk about LLMs affects people's vocabulary.
I’ve been driving Claude as my primary coding interface the last three months at my job. Other than a different domain, I feel like I could have written this exact article.
The project I’m on started as a vibe-coded prototype that quickly got promoted to a production service we sell.
I’ve had to build the mental model after the fact, while refactoring and ripping out large chunks of nonsense or dead code.
But the product wouldn’t exist without that quick and dirty prototype, and I can use Claude as a goddamned chainsaw to clean up.
On Friday, I finally added a type checker pre-commit hook and fixed the 90 existing errors (properly, no type ignores) in ~2 hours. I tried full-agentic first, and it failed miserably, then I went through error by error with Claude, we tightened up some exiting types, fixed some clunky abstractions, and got a nice, clean result.
AI-assisted coding is amazing, but IMO for production code there’s no substitute for human review and guidance.
My process: start ideating and get the AI to poke holes in your reasoning, your vision, scalability, etc. do this for a few days while taking breaks. This is all contained in one Md file with mermaid diagrams and sections.
Then use ideation to architect, dive into details and tell the AI exactly what your choices are, how certain methods should be called, how logging and observability should be setup, what language to use, type checking, coding style (configure ruthless linting and formatting before you write a single line of code), what testing methodology, framework, unit, integration, e2e. Database, changes you will handle migrations, as much as possible so the AI is as confined as possible to how you would do it.
Then, create a plan file, have it manage it like a task list, and implement in parts, before starting it needs to present you a plan, in it you will notice it will make mistakes, misunderstand some things that you may me didn’t clarify before, or it will just forget. You add to AGENTS.md or whatever, make changes to the ai’s plan, tell it to update the plan.md and when satisfied, proceed.
After done, review the code. You will notice there is always something to fix. Hardcoded variables, a sql migration with seed data that should actually not be a migration, just generally crazy stuff.
The worst is that the AI is always very loose on requirements. You will notice all its fields are nullable, records have little to no validation, you report an error when testing and it tried to solve it with an brittle async solution, like LISTEN/NOTIFY or a callback instead of doing the architecturally correct solution. Things that at scale are hell to debug, especially if you did not write the code.
If you do this and iterate you will gradually end up with a solid harness and you will need to review less.
> After done, review the code. You will notice there is always something to fix. Hardcoded variables, a sql migration with seed data that should actually not be a migration, just generally crazy stuff.
>
> The worst is that the AI is always very loose on requirements. You will notice all its fields are nullable, records have little to no validation, you report an error when testing and it tried to solve it with an brittle async solution, like LISTEN/NOTIFY or a callback instead of doing the architecturally correct solution. Things that at scale are hell to debug, especially if you did not write the code.
For that I usually get it reviewed by LLMs first, before reviewing it myself.
Same model, but clean session, different models from different providers. And multiple (at least 2) automated rounds of review -> triage by the implementing session -> addressing + reasons for deferring / ignoring deferred / ignored feedbacks -> review -> triage by the implementing session -> …
Works wonders.
Committing the initial spec / plan also helps the reviewers compare the actual implementation to what was planned. Didn’t expect it, but it’s worked nicely.
I agree! It should be very stable, IMO. If not, then please send a bug report and we'll look into it. Also, now it scales well with the number of listening connections (given clients listen on unique channel names): https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...
The LISTEN/NOTIFY feature really just doesn’t get enough PR. It is perfectly suitable for production workloads yet people still want to reach for more complicated solutions they don’t need.
I find it very interesting that you assume this method would branch out to other projects. I find it even more interesting that you assume all software codebases use a database, give a damn about async anything, and that these ideas percolate out to general software engineering.
Sounds like a solid way to make crud web apps though.
GP is clearly providing examples of categories of tasks. Sure, not all languages do “async fn foo()”, but almost all problem domains involve some sort of making sure the right things happen at the right times, which is in a similar ballpark.
Holier than thou “yeah well I work on stuff that doesn’t use databases, checkmate!” doesn’t really land - data still gets moved around somehow, and often over a network!
I've noticed this too, but not necessarily type checkers, but more with linters. And can't really figure out if there's even a way to solve it.
If you set up restrictive linters and don't explicitly prohibit agents from adding inline allows, most LOC will be allow comments.
Based on this learning, I've decided to prohibit any inline allows. And then agents started doing very questionable things to satisfy clippy.
Recent example:
- Claude set up a test support module so that it could reuse things. Since this was not used in all tests, rust complained about dead_code. Instead of making it work, claude decided to remove test support module and just... blow up each test.
If you enable thinking summaries, you'll always see agent saying something like: "I need to be pragmatic", which is the right choice 50% of the time.
I can't agree here. https://pelorus-nav.com/ (one of my side projects) is 95-98% written by Claude Opus 4.6, all in very nice typescript which I carefully review and correct, and use good prompting and context hygiene to ensure it doesn't take shortcuts. It's taken a month or so but so worth it. And my packing list app packzen.org is also pretty decent typescript all through.
So you do agree? If you are having to review and correct then it's not really the LLM writing it anymore. I have little doubt that you can write good Typescript, but that's not what I said. I said LLMs cannot write good Typescript and it seems you agree given your purported actions towards it. Which is quite unlike some other languages where LLMs write good code all the time — no hand holding necessary.
I find correction is rarely necessary with Opus 4.6. Definitely not so much that "it's not really the LLM writing it anymore." More like it's the author and I'm the editor (in this limited case -- of course architecturally the ideas are all mine.)
But I totally respect that my prompt style, the type of app I'm writing, and other factors could be influencing my success vs. others' lack of success.
> of course architecturally the ideas are all mine.
What else would you need to correct? I've never had trouble with LLMs generating basic syntax in any language. Architecture is exactly the aspect of language where LLMs seem to like to go to crazytown when in Typescript. It seems you've noticed too if the ideas in that area have had to come all from you.
I think it can write working TypeScript code, and it can write good TypeScript code if it is guided by a knowledgable programmer. It requires actually reviewing all the code and giving pointed feedback though (which at that point is only slightly more efficient than just writing it yourself).
> It requires actually reviewing all the code and giving pointed feedback though
Exactly. You can write good Typescript, no doubt, but LLMs cannot. This is not like some other languages where LLM generated code is actually consistently good without needing to become the author.
I've found it's less about specificity and more about removing the # of critical assumptions it needs to make. Being too specific can be a hindrance in it's own regard.
And that's also a decent barometer for what it's good at. The more amount of critical assumptions AI needs to make, the less likely it is to make good ones.
For instance, when building a heat map, I don't have to get specific at all because the amount of consequential assumptions it needs to make is slim. I don't care or can change the colors, or the label placement.
I caught it using Parameters<typeof otherfn>[2] the other day. It wanted to avoid importing a type, so it did this nonsense. (I might have the syntax slightly wrong here, I'm writing from memory.)
But it's not all bad news. TIL about Parameters<T>.
Fwiw, the article mirrors my experience when I started out too, even exactly with the same first month of vibecoding, then the next project which I did exactly like he outlined too.
Personally, I think it's just the natural flow when you're starting out. If he keeps going, his opinion is going to change and as he gets to know it better, he'll likely go more and more towards vibecoding again.
It's hard to say why, but you get better at it. Even if it's really hard to really put into words why
It's a little like asking a cokehead how the addiction is going for him while he is high. Obviously he's going to say it's great because the consequences haven't hit him. Some percentage of addicts will never realize it was a problem at all.
Its not random that AI happens to be built by the very same people that turned internet forums into the most addictive communication technology ever.
> he'll likely go more and more towards vibecoding again
I think "more and more" is doing some very heavy lifting here. On the surface it reads like "a lot" to many people, I think, which is why this is hard to read without cringing a bit. Read like that it comes off as "It's very addictive and eventually you get lulled into accepting nonsense again, except I haven't realized that's what's happening".
But the truth is that this comment really relies entirely on what "more and more" means here.
You can’t put it into words? Why? Perhaps you haven’t looked at it objectively?
It may actually be true. Your feeling might be right - but I strongly caution you against trusting that feeling until you can explain it. Something you can’t explain is something you don’t understand.
have you ever learned a skill?
Like carving, singing, playing guitar, playing a video game, anything?
It's easy to get better at it without understanding why you're better at it. As a matter of fact, very very few people master the discipline enough to be able to grasp the reason for why they're actually better
Most people just come up with random shit which may or may not be related. Which I just abstained from.
I've learned a number of skills, and for me none of them worked in the way you're describing. I didn't learn to cut good miter joints by randomly vibe-sawing wood until I unlocked miter joints in the skill tree. I carefully studied the errors I made, and adjusted in ways I thought might correct them, some of which helped some of which did not. Then eventually I understood the relationship between my actions and the underlying principles in enough detail to consistently hit 45 degrees.
Isn't that example pretty reductive, in that you have a directly-measurable output? I mean, the joint is either 45° (well, 90°) or it's not. Zoom out a bit, and the skill-set becomes much less definable: are my cabinets good - for some intersection of well-proportioned, elegantly-finished, and fit for purpose, with well-chosen wood and appropriate hardware.
Mind you, I don't think the process of improvement in those dimensions is fundamentally different, just much less direct and not easily (or perhaps even at all) articulable.
You can get better at something without understanding why, but you should be able to think about it and determine why fairly easily.
This is something everyone who cares about improving in a skill does regularly - examine their improvement, the reasons behind it, and how to add to them. That’s the basis of self-driven learning.
This is an absurd statement. There are many complex undertakings in sport where even the very best get better with practice and can't tell you why. In fact, the ones who think they can tell you why are the one's to be most skeptical of.
You are just making stuff up or regurgitating material from a pop science book.
Instead of accusing others of making things up, perhaps step back and re-evaluate the conversation you're taking part in. In this instance, it appears that you misunderstood or skipped over the word "learning".
Not really. I can obviously say something, like you learn which features the models are able to actually implement, and you learn how to phrase and approach trickier features to get the model too do what you want.
And that's not really explainable without exploring specific examples. And now we're in thousands of words of explanation territory, hence my decision to say it's hard to put it into words.
I think you’re handwaving away vague, ungrounded intuition and calling it learning.
For instance, if I say “I noticed I run better in my blue shoes than my red shoes” I did not learn anything. If I examine my shoes and notice that my blue shoes have a cushioned sole, while my red shoes are flat, I can combine that with thinking about how I run and learn that cushioned soles cause less fatigue to the muscles in my feet and ankles.
The reason the difference matters is because if I don’t do the learning step, when buy another pair of blue shoes but they’re flat soled, I’m back to square one.
Back to the real scenario, if you hold on to your ungrounded intuition re what tricks and phrasing work without understanding why, you may find those don’t work at all on a new model version or when forced to change to a different product due to price, insolvency, etc.
You're always free to stop at the level of abstraction at which you find a certain answer to be satisfying, but you can also keep digging. Why are flat shoes better? Well, it's to do with my gait. Ok, but why is my gait like that? Something-something musculoskeletal. Why is my body that way? Something-something genetic. OK, but why is that? And so on.
Pursued far enough, any line of thought will reach something non-deterministic - or, simply, That's The Way It Is - however unsatisfying that is to those of us who crave straightforward answers. Like it or not, our ground truth as human beings ultimately rests on intuition. (Feel free to say, "No, it's physics", or "No, it's maths", but I'll ask you if you're doing those calculations in your head as you run!)
It is very silly to treat zero grounding the same as accepting core, proven concepts. Your PoV here is no different than saying "It rains because god is sad and crying" is an appropriate thing to believe.
If you want to say "god is responsible for creating the precipitation cycle", sure. But we don't disregard understanding that exists to substitute intuition.
We're talking past each other, and mixing up some concepts, most of which is my fault for not writing particularly clearly.
Yeah, "God did it" is the first of those answer layers at which some people stop interrogating the world around them, just like "that's just the way I am" is where some people stop developing their self-understanding. Neither of those answers advance civilization / ourselves any further than the status quo. They're terrible answers! Everyone should be digging deeper.
However, I would not use the word "understanding" in opposition to "intuition". Someone who can generate a ballistics chart understands trajectories, but so does someone who can reliably put a basketball through a hoop or a bullet on target. I would set "analysis" against "intuition" (or "instinct", if you prefer), but they're not in opposition: instead, they reinforce each other. We're all familiar with the scientists and mathematicians who ride a hunch to a ground-breaking discovery, which is then validated by exhaustive analysis. From the other direction, athletes and musicians analyze their technique in minute detail, and practice incessantly, in order to ingrain analytical insights into instinct. (Or, if you prefer a less physical example, programmers study algorithms so that they can intuit which to apply to a particular problem.)
My point - badly expressed in my earlier comment - is that as humans we exist moment-by-moment, and as such react, in each moment, by intuition. As important as analysis is, we cannot live in analytical mode: it lags too much! Furthermore, approximately none of us will ever make a groundbreaking discovery in any field, far less in all of the areas to which we can (and should!) direct our analytical energy. At some point we have to stop (even if we are a groundbreaking genius in one area, we'll have to in all of the others), and accept the answer that satisfys our purpose or exhausts our motivation.
The TL;DR dos not seem to match the rest of the article.
They claim the agents reliably generated a week’s worth of dev work for $20 in tokens, then go on to list all the failure modes and debugging they had to do to get it to work, and conclude with “Agents are not ready to autonomously ship every integration end-to-end.”
Generally a good write up that matches my experience (experts can make systems that can guide agents to do useful work, with review), but the first section is pretty misleading.
Having spent a couple years rehabbing a 100 year old house, I’m convinced the trades will be the last thing to go. When the building you’re working on has been ship-of-Theseus’d by 3 generations of home owners, everything is out of distribution.
When a robot can reliably do this work, I think it can reliably do any human job that requires physical ability and judgement.
But the problem wont be the robots. Itll be the flood of new workers who will offer to rehab the place cheaper than you. And itll be that the white collar owners of the house wont have enough money to blow on a rehab bwcause their desk jobs are getting replaced by AI
Especially if you get into a specialized trade for people with money.
I’ve repaired a lot of my historic windows myself because of how expensive it is to get someone else to do it. (Quoted 8k for one leaded glass window) I think it’s become my new backup job if I really am replaced by a computer.
We really need automated roofing. Installing shingles is easy, except that it has to be done on top of buildings. There's an experimental roofing robot, but it's not good enough for production yet.[1]
Metal roofs seem nice and easier to install too, but at least where I had a house built (Ireland) the local planners (aka meddling old people with too much time) thought it wasn’t suitable for a “home” so you had to spend four times as much on a slate roof.
Eh, it's been cheaper and better for a long time to just demolish and rebuild rather than deal with neverending issues at major fixer uppers. Robots probably would be able to do uncomplicated cookie-cutter builds in a decade or two, there's just too much money in the construction sector that AI companies looking for the next big thing to disrupt can't ignore.
LLMs rarely if ever proactively identify cleanup refactors that reduce the complexity of a codebase. They do, however, still happily duplicate logic or large blocks of markup, defer imports rather than fixing dependency cycles, introduce new abstractions for minimal logic, and freely accumulate a plethora of little papercuts and speed bumps.
These same LLMs will then get lost in the intricacies of the maze they created on subsequent tasks, until they are unable to make forward progress without introducing regressions.
You can at this point ask the LLM to rewrite the rat’s nest, and it will likely produce new code that is slightly less horrible but introduces its own crop of new bugs.
All of this is avoidable, if you take the wheel and steer the thing a little. But all the evidence I’ve seen is that it’s not ready for full automation, unless your user base has a high tolerance for bugs.
I understand Anthropic builds Claude Code without looking at the code. And I encounter new bugs, some of them quite obvious and bad, every single day. A Claude process starts at 200MB of RAM and grows from there, for a CLI tool that is just a bundle of file tools glued to a wrapper around an API!
I think they have a rats nest over there, but they’re the only game in town so I have to live with this nonsense.
> - This is partly b/c it is good at things I'm not good at (e.g. front end design)
Everyone thinks LLMs are good at the things they are bad at. In many cases they are still just giving “plausible” code that you don’t have the experience to accurately judge.
I have a lot of frontend app dev experience. Even modern tools (Claude w/Opus 4.6 and a decent Claude.md) will slip in unmaintainable slop in frontend changes. I catch cases multiple times a day in code review.
Not contradicting your broader point. Indeed, I think if you’ve spent years working on any topic, you quickly realize Claude needs human guidance for production quality code in that domain.
Yes I’ve seen this at work where people are promoting the usage of LLMs for.. stuff other people do.
There’s also a big disconnect in terms of SDLC/workflow in some places.
If we take at face value that writing code is now 10x faster, what about the other parts of the SDLC? Is your testing/PR process ready for 10x the velocity or is it going to fall apart?
What % of your SDLC was actually writing code? Maybe time to market is now ~18% faster because coding was previously 20% of the duration.
I’ve been slow to invest in building flows around parallelizing agent work under the assumption that eventually inference will get fast enough that I will basically always be the bottleneck.
Excited to see glimpses of that future. Context switching sucks and I’d much rather work focused on one task while wielding my coding power tools.
It sounds like in this case there was some troll-fueled comeuppance.
> “We’re not a scam,” he continued. “We’re a married couple trying to do the right thing by people … We are legit, we are real people, we employ sales staff.”
> Australian Tours and Cruises told CNN Tuesday that “the online hate and damage to our business reputation has been absolutely soul-destroying.”
This might just be BS, but at face-value, this is a mom and pop shop that screwed up playing the SEO game and are getting raked over the internet coals.
Your broader point about blame-washing stands though.
That's the thing about scammers, they operate in plausibly deniable ways, like covering up malice with incompetence. They make taking things at face value increasingly costly for the aggrieved.
I use it in a Python/TS codebase (series D B2B SaaS with some AI agent features). It can usually “make it work” in one shot, but the code often requires cleanup.
I start every new feature w/Claude Code in plan mode. I give it the first step, point it to relevant source files, and tell it to generate a plan. I go catch up on my Slack messages.
I check back in and iterate on the plan until I’m happy, then tell it to implement.
I go to a team meeting.
I come back and review all the code. Anything I don’t 100% understand I ask Gemini to explain. I cross-check with primary sources if it’s important.
I tweak the generated code by hand (faster than talking with the agent), then switch back to plan mode and ask for specific tests. I almost always need to clean up the tests for doing way too much manual setup, despite a lot of Claude.md instructions to the contrary.
In the end, I probably get the work done in 30% less wall-clock time of Claude implementing (counting plan time), but I’m also doing other things while the agent crunches. Maybe 50% speed boost in total productivity? I also learn something new on about a third of features, which is way more than I did before.
If I spend 40 hours a week talking to anybody, some of their language or mannerisms are going to rub off on me. I can’t think of a compelling reason why a human-sounding chat bot would be any different.
reply