I haven't time to do it but can someone try to unminify the newer version based on the minified new version + the source of previous version? There's gotta be a way to do this
> They have terms to not allow `claude -p` used like that.
Like what? I legitimately don't understand what is prohibited. Using claude as part of a shell script? Am I only allowed to use claude if a physically type the commands into a terminal via my keyboard? Why even ship `claude -p` at all?
downloading the official ones for my m3 max 128GB via lm studio I can't seem to get them to load. they fail for some unknown reason. have to dig into the logs. any luck for you?
The Unsloth llama.cpp guide[1] recommends building the latest llama.cpp from source, so it's possible we need to wait for LM Studio to ship an update to its bundled llama.cpp. Fairly common with new models.
This is the single worst function in the codebase by every metric:
- 3,167 lines long (the file itself is 5,594 lines)
- 12 levels of nesting at its deepest
- ~486 branch points of cyclomatic complexity
- 12 parameters + an options object with 16 sub-properties
- Defines 21 inner functions and closures
- Handles: agent run loop, SIGINT, rate-limits, AWS auth, MCP lifecycle, plugin install/refresh, worktree bridging, team-lead polling (while(true) inside), control message dispatch (dozens of types), model switching, turn interruption
recovery, and more
Looks like it tries wl-copy, then tries xclip and then tries xsel. I have no idea what those are but google says it's for Wayland, so, I think it's a linux function trying to copy to clipboard? I think their problem is with the use of '.then(...=>...)' since there doesn't seem to be a way to tell each function that the nested ones actually finished.
wl-copy is a program to put text into the system clipboard if you're on a wayland-based system (so you can ctrl-v paste it somewhere else). Imagine like, cat ~/.ssh/whatever | wl-copy and then pasting into github or something.
I'm sure this is no surprise to anyone who has used CC for a while. This is the source of so many bugs. I would say "open bugs" but Anthropic auto-closes bugs that don't have movement on them in like 60 days.
> This issue has been automatically locked since it was closed and has not had any activity for 7 days. If you're experiencing a similar issue, please file a new issue and reference this one if it's relevant.
> This should be at minimum 8–10 separate modules.
Can't really say that for sure. The way humans structure code isn't some ideal best possible state of computer code, it's the ideal organization of computer code for human coders.
Nesting and cyclomatic complexity are indicators ("code smells"). They aren't guaranteed to lead to worse outcomes. If you have a function with 12 levels of nesting, but in each nest the first line is 'return true', you actually have 1 branch. If 2 of your 486 branch points are hit 99.999% of the time, the code is pretty dang efficient. You can't tell for sure if a design is actually good or bad until you run it a lot.
One thing we know for sure is LLMs write code differently than we do. They'll catch incredibly hard bugs while making beginner mistakes. I think we need a whole new way of analyzing their code. Our human programming rules are qualitative because it's too hard to prove if an average program does what we want. I think we need a new way to judge LLM code.
The worst outcome I can imagine would be forcing them to code exactly like we do. It just reinforces our own biases, and puts in the same bugs that we do. Vibe coding is a new paradigm, done by a new kind of intelligence. As we learn how to use it effectively, we should let the process of what works develop naturally. Evolution rather than intelligent design.
I don't buy this. Claude doesn't usually have any issues understanding my code. It has tons of issues understanding its code.
The difference between my code and Claude's code is that when my code is getting too complex to fit in my head, I stop and refactor it, since for me understanding the code is a prerequisite for writing code.
Claude, on the other hand, will simply keep generating code well past the point when it has lost comprehension. I have to stop, revert, and tell it to do it again with a new prompt.
If anything, Claude has a greater need for structure than me since the entire task has to fit in the relatively small context window.
> One thing we know for sure is LLMs write code differently than we do.
Kind of. One thing we do know for certain is that LLMs degrade in performance with context length. You will undoubtedly get worse results if the LLM has to reason through long functions and high LOC files. You might get to a working state eventually, but only after burning many more tokens than if given the right amount of context.
> The worst outcome I can imagine would be forcing them to code exactly like we do.
You're treating "code smells" like cyclomatic complexity as something that is stylistic preference, but these best practices are backed by research. They became popular because teams across the industry analyzed code responsible for bugs/SEVs, and all found high correlation between these metrics and shipping defects.
Yes, coding standards should evolve, but... that's not saying anything new. We've been iterating on them for decades now.
I think the worst outcome would be throwing out our collective wisdom because the AI labs tell us to. It might be good to question who stands to benefit when LLMs aren't leveraged efficiently.
> They became popular because teams across the industry analyzed code responsible for bugs/SEVs, and all found high correlation between these metrics and shipping defects.
Yes, based on research of human code. LLMs write code differently. We should question whether the human research applies to LLMs at all. (You wouldn't take your assumptions about chimp research and apply them to parrots without confirming first)
> I think the worst outcome would be throwing out our collective wisdom because the AI labs tell us to.
We don't have to throw it out. But our current use of LLMs are a dramatic change from what came before. We should be questioning our assumptions and traditions that come from a different way of working and intelligence. Humans have a habit of trying to force things to be how they think they should be, rather than allowing them to grow organically, when the latter is often better for a system we don't yet understand.
They write code differently but that doesn't mean that's the kind of code they prefer to read. Don't ascribe too much intention to a stochastic process.
Their coding style is above all else a symptom of their very limited context window and complete amnesia for anything that's not in the window.
I don't think there's intention. And yes, its output is defined by its limits. But it's not just the context, is it? Their coding style is, above all else, a result of an algorithm and input. The training data, the reinforcement, the model design, the tuning, the prompt, the context. Change any one of those things and the code changes. They are a system, like an ecosystem. Let water flow and it finds its own path. But try to dam it and it creates unintended consequences. I think what we're going to find is some of our rules apply more to a human world than an LLM world.
I’ve heard this take before, but if you’ve spent any time with llm’s I don’t understand how your take can be: “I should just let this thing that makes mistakes all the time and seems oblivious to the complexity it’s creating because it only observes small snippets out of context make it’s own decisions about architecture, this is just how it does things and I shouldn’t question it.”
I think this view assumes no human will/should ever read the code. This is considered bad practice because someone else will not understand the code as well whether written by a human or agent. Unless 0% human oversight is needed anymore agents should still code like us.
Weird and inscrutable can be good: think genetic algorithms [1] such as antenna optimization for EM radiation [2]. But I like my source code on the intelligible side.
the claude code team ethos, as far as i’ve been lead to understand— which i agree with, mind you— is that there is no point in code-reviewing ai-generated code… simply update your spec(s) and regenerate. it is just a completely different way of interacting with the world. but it clearly works for them, so people throwing up their hands should at least take notice of the fact that they are absolutely not competing with traditional code along traditional lines. it may be sucky aesthetically, but they have proven from their velocity that it can be extremely effective. welcome to the New World Order, my friend.
There's a reputational filtering that happens when using dependencies. Stars, downloads, last release, who the developer is, etc.
Yeah we get supply chain attacks (like the axios thing today) with dependencies, but on the whole I think this is much safer than YOLO git-push-force-origin-main-ing some vibe-coded trash that nobody has ever run before.
I also think this isn't really true for the FAANGs, who ostensibly vendor and heavily review many of their dependencies because of the potential impacts they face from them being wrong. For us small potatoes I think "reviewing the code in your repository" is a common sense quality check.
I'd trust that dude over professional leetcoders any day.
But you're right that trust is a complicated thing and often misplaced. I think as an industry we're always reevaluating our relationship with OSS, and I'm sure LLMs will affect this relationship in some way. It's too early to tell.
I find this relationship fascinating. since the OSS vast majority of the developers will not hesitate to pull in library X or framework Y knowing really nothing about it, who are developers, what is the quality of it, what is their release process, qa etc etc... The first thing I do now as a "senior" for decades when I get approached with "we should consider using ____" is to send them to their issues page ( e.g. https://github.com/oven-sh/bun/issues ) and then be like "spend 60-90 minutes minimum here reviewing the issues - then come back and tell me whether or not the inclusion of this is something we should consider." and yet, now with LLMs there are sooooooooo many comments on HN like "oh they must be supervised, who knows what they will be doing etc..." - gotta supervise them but some mate in Boise is all good, hopefully someone else will review his stuff that is going into your next release ...
Is the CEO responsible for a company's financial performance? Do they review every line of code the company writes?
It is more irresponsible to spend the time reviewing all of the code rather than spending that time on things with bigger levers for satisfying your customers.
yes but if a dev pushes a line of code that wipes the accounts of millions of users at a fintech, the dev will get fired but the CEO will get sued into oblivion.
if the agent isn't responsible, you HAVE to be, cause angry people wont listen to "it's no ones fault your money is gone"
Is this a serious question? If you are handling sensitive information how do you confirm your application is secure and won't leak or expose information to people who shouldn't know it?
Exactly.... -> Unit tests. Integration tests. UI tests. This is how code should be verified no matter the author. Just today I told my team we should not be reading every line of LLM code. Understand the pattern. Read the interesting / complex parts. Read the tests.
But unit and integration tests generally only catch the things you can think of. That leaves a lot of unexplored space in which things can go wrong.
Separately, but related - if you offload writing of the tests and writing of the code, how does anybody know what they have other than green tests and coverage numbers?
I have been seeing this problem building over the last year. LLM generated logic being tested by massive LLM generated tests.
Everyone just goes overboard with the tests since you can easily just tell the LLM to expand on the suite. So you end up with a massive test suite that looks very thorough and is less likely to be scrutinized.
if you are asking me how you *guarantee* there is not a single possible exploit in your code, you can't do that. But you can do your best and learn about common pitfalls and be reasonably competent. Just because you can't do the former doesn't mean the latter is useless.
While the technology is young, bugs are to be expected, but I'm curious what happens when their competitors' mature their product, clean up the bugs and stabilize it, while Claude is still kept in this trap where a certain number of bugs and issues are just a constant fixture due to vibe coding. But hey, maybe they really do achieve AGI and get over the limitations of vibe coding without human involvement.
Because in reality no one except for good engineers actually care about what the code looks like. The only thing most users care about with Claude Code is having it quickly vibe code the crappy idea they came up with that is going to 10x their lives, or whatever.
I agree the functions in a file should probably be reasonably-sized.
It's also interesting to note that due to the way round-tripping tool-calls work, splitting code up into multiple files is counter-productive. You're better off with a single large file.
Im not sure that Humans are great at this either. Think about how we use frameworks and have complex supply chains... we sort of get "good enough" at what we need to do and pray a lot that everything else keeps working and that our tooling (things like artifactory) save us from supply chain attacks. Or we just run piles of old, outdated code because "it works". I cant tell you how many micro services I have seen that are "just fine" but no one in the current org has ever read a line of what's in them, and the people who wrote them left ages ago.
> clarity too
Yes, but define clarity!
I recently had the pleasure of fixing a chunk of code that was part of a data pipeline. It was an If/elseif/elseif structure... where the final two states were fairly benign and would have been applicable in 99 percent of cases. Everything else was to deal with the edge cases!
I had an idea of where the issue was, but I didn't understand how the code ended up in the state it was in... Blame -> find the commit message (references ticket) -> find the Jira ticket (references sales force) -> find the original customer issue in salesforce, read through the whole exchange there.
A two line comment could have spared me all that work, to get to what amounted to a dead simple fix. The code was absolutely clear, but without the "why" portion of the context I likely would have created some sort of regression, that would have passed the good enough testing that was there.
I re-wrote a portion of the code (expanding variable names) - that code is now less "scannable" and more "readable" (different types of clarity). Dropped in comments: a few sentences of explaining, and references to the tickets. Went and updated tests, with similar notes.
Meanwhile, elsewhere (other code base, other company), that same chain is broken... the "bug tracking system" that is referenced in the commit messages there no longer exists.
I have a friend who, every time he updates his dev env, he calls me to report that he "had to go update the wiki again!" Because someone made a change and told every one in a slack message. Here is yet another vast repository of degrading, unsearchable and unusable tribal knowledge embedded in so many organizations out there.
Don't even get me started on the project descriptions/goals/tasks that amount to pantomime a post-it notes, absent of any sort of genuine description.
Lack of clarity is very much also a lack of "context" in situ problem.
I think humans are pretty good at it with small teams and the right structure. There are definitely dysfunctional orgs as you describe where humans produce garbage code yes. I blame the org for that, not the humans.
As to what defines clarity, yes of course, like the word quality this is very hard to define, but we can certainly recognise when it was not considered.
I think it is a goal worth striving for though, and abandoning code standards because we now have AI helpers is stupid and self-defeating, even if we think they are very capable and will improve.
The end of history has not in fact arrived with generative AI, we still have to maintain software after.
Unit testing is much much harder when you have functions spanning thousands of lines and no abstractions. You have to white box test everything to ensure that you hit all code paths, and it is much more expensive to maintain such tests, both as a human and LLM. I don't think this can be ignored just because LLMs are writing the code.
Ye I honestly don't understand his comment. Is it bad code writing? Pre 2026? Sure. In 2026. Nope. Is it going to be a headache for some poor person on oncall? Yes. But then again are you "supposed" to go through every single line in 2026? Again no. I hate it. But the world is changing and till the bubble pops this is the new norm
My first word was litteraly "Yes", so I agree that a function like this is a maintenance nightmare for a human.
And, sure, the code might not be "optimized" for the LLM, or token efficiency.
However, to try and make my point clearer: it's been reported that anthropic has "some developpers won't don't write code" [1].
I have no inside knowledge, but it's possible, by extension, to assume that some parts of their own codebase are "maintained" mostly by LLMs themselves.
If you push this extension, then, the code that is generated only has to be "readable" to:
* the next LLM that'll have to touch it
* the compiler / interpreter that is going to compile / run it.
In a sense (and I know this is a stretch, and I don't want to overdo the analogy), are we, here, judging a program quality by reading something more akin to "the x86 asm outputed by the compiler", rather than the "source code" - which in this case, is "english prompts", hidden somewhere in the claude code session of a developper ?
Just speculating, obviously. My org is still very much more cautious, and mandating people to have the same standard for code generated by LLM as for code generated by human ; and I agree with that.
I would _not_ want to debug the function described by the commentor.
So I'm still very much on the "claude as a very fast text editor" side, but is it unreasonnable to assume that anthropic might be further on the "claude as a compiler for english" side ?
The dataset miscomparison is a big problem. The prompt is super specific to ARC-AGI-3, which is perfectly fine to do, but skimming it I saw nothing that appears specific to the 25 games in the dataset. Especially considering they've only had one day for overfitting. Could be quite subtle leakage though.
If it was not spinning so many Python processes and not overwhelming the system with those (friends found out this is consuming too much CPU from the fan noise!) it would have been much more successful. So similar to xz attack
it does a lot of CPU intensive work
spawn background python
decode embedded stage
run inner collector
if data collected:
write attacker public key
generate random AES key
encrypt stolen data with AES
encrypt AES key with attacker RSA pubkey
tar both encrypted files
POST archive to remote host
playwright can do all of that too. I'm confused why this is necessary.
If coding agents are given the Playwright access they can do it better actually because using Chrome Developer Tools Protocol they can interact with the browser and experiment with things without having to wait for all of this to complete before making moves. For instance I've seen Claude Code captures console messages from a running Chrome instance and uses that to debug things...
reply