POC, sure (although 10x-ing a POC doesn't actually get you 10x velocity). MVP, though? No way. Today's frontier models are nowhere near smart enough to write a non-trivial product (i.e. something that others are meant to use), minimal or otherwise, without careful supervision. Anthropic weren't able to get agents to write even a usable C compiler (not a huge deal to begin with), even with a practically infeasible amount of preparatory work (write a full spec and a reference implementation, train the model on them as well as on relevant textbooks, write thousands of tests). The agents just make too many critical architectural mistakes that pretty much guarantee you won't be able to evolve the product for long, with or without their help. The software they write has an evolution horizon between zero days and about a year, after which the codebase is effectively bricked.
There is a million things in between a C compiler and a non-trivial product. They do make a ton of horrible architectural decisions, but I only need to review the output/ask questions to guide that, not review every diff.
A C compiler is a 10-50KLOC job, which the agents bricked in 0 days despite a full spec and thousands of hand-written tests, tests that the software passed until it collapsed beyond saving. Yes, smaller products will survive longer, but how would you know about the time bombs that agents like hiding in their code without looking? When I review the diffs I see things that, if had let in, the codebase would have died in 6-18 months.
BTW, one tip is to look at the size of the codebase. When you see 100KLOC for a first draft of a C compiler, you know something has gone horribly wrong. I would suggest that you at least compare the number of lines the agent produced to what you think the project should take. If it's more than double, the code is in serious, serious trouble. If it's in the <1.5x range, there's a chance it could be saved.
Asking the agent questions is good - as an aid to a review, not as a substitute. The agents lie with a high enough frequency to be a serious problem.
The models don't yet write code anywhere near human quality, so they require much closer supervision than a human programmer.
A C compiler with an existing C compiler as oracle, existing C compilers in the training set, and a formal spec, is already the easiest possible non-trivial product an agent could build without human review.
You could have it build something that takes fewer lines of code, but you aren’t gonna to find much with that level of specification and guardrails.
> We're also a neural network, are we any more clever than a simulated one?
This is tangential, but it is highly unlikely that we are "a neural network". Neural networks are an architecture loosely inspired by some aspects of the brain, but e.g. it's highly unlikely that we learn by backpropagation (neural signals don't travel in that direction). The brain is a network of neurons, but neural networks are something else. Neural networks probably don't work as the brain does.
The problem is that the search space is so large that correcting errors via guardrails is only effective if the original error rate is low (how many Integer -> Integer functions are there? There's ~1 way to get it right and ~∞ ways to get it wrong).
Sure, we can help the easy cases, but that's because they're easy to begin with. In general, we know (or at least assume) that being able to check a solution tractably does not make finding the solution tractable, or we'd know that NP = P. So if LLMs could effectively generate a proof that they've found the correct Integer -> Integer function, either that capability will be very limited or we've broken some known or assumed computational complexity limit. As Philippe Schnoebelen discovered in 2002 [1], languages cannot reduce the difficulty of program construction or comprehension.
Of course, it is possible that machine learning could learn some class of problems previously unknown to be in P and find that it is in P, but we should understand that that is what it's done: realised that the problem was easy to begin with rather than finding a solution to a hard problem. This is valuable, but we know that hard problems that are of great interest do exist.
>As Philippe Schnoebelen discovered in 2002 [1], languages cannot reduce the difficulty of program construction or comprehension.
From a model-checking point of view. This is about taking a proof-theoretic approach...
Your last paragraph is also quite wrong: a machine learning could very well easily learn and solve an NP-complete problem, because this property does not say anything about average case complexity (and we should consider Probabilistic complexity classes, so the picture is even more "complex").
> From a model-checking point of view. This is about taking a proof-theoretic approach...
No. In complexity theory we deal with problems, and the model-checking problem is that of determining whether a program satisfies some property or not. If your logic is sound, you can certainly use an algorithm based on the logic's deductive theory (which could be type theory, but that's an unimportant detail) to decide the problem, but that can have no impact whatsoever on the complexity of the problem. The result applies to all decision procedures, be they model-theoretic or deductive (logic-theoretic).
> Your last paragraph is also quite wrong: a machine learning could very well easily learn and solve an NP-complete problem, because this property does not say anything about average case complexity
No. First, it's unclear what "average complexity" means here, but for any reasonable definition, the "average complexity" of NP-hard problems is not known to be tractable. Second, complexity theory approaches this issue (of "some instances may be easier") using parameterised complexity [1], and I'm afraid that the results for the model-checking problem - which, again, is the inherent difficulty of knowing what a program does regardless of how you do it - are not very good. I mentioned such a result in an old blog post of mine here [2]. (Parameterised complexity is more applicable than probabilistic complexity here because even if there were some reasonable distribution of random instances, it's probably not the distribution we'd care about.)
There is no escape from complexity limits, and the best we hope for is to find out that problems we're interested in have actually been easier than we thought all along. Of course, some people believe that the programs people actually write are somehow in a tractable complexity class that we've not been able to define - and maybe one day we'll discover that that's the case - but what we've seen so far suggests it isn't: If programs that people write are somehow easier to analyse, then we'd expect to see the size of programs we can soundly analyse grow at the same pace as the size of programs people write, and nothing can be further from what we've observed. The size of programs that can be "proven correct" (especially using deductive methods!) has remained largely the same for decades, while the size of programs people write has grown considerably over that period of time.
I think there's another problem with AI doomerism, which is the belief that superhuman intelligence (even if such a thing could be defined and realised) results in godlike powers. Many if not most systems of interest in the world are non-linear and computationally hard; controlling/predicting them requires pure computational power that no amount of intelligence (whatever it means) can compensate for. On the other hand, dynamics we do (roughly) understand and can predict, don't require much intelligence, either. To the extent some problems are solvable with the computational power we have, some may require data collection and others may require persuasion through charisma. The claim that intelligence is the factor we're lacking is not well supported.
Ascribing a lot of power to intelligence (which doesn't quite correspond to what we see in the world) is less a careful analysis of the power of intelligence and more a projection of personal fantasies by people who believe they are especially intelligent and don't have the power they think they deserve.
Political power is the bottleneck for most shit that matters, not computational power.
Most of the stuff that sucks on the us sucks because of entrenched institutions with perverse interests (health insurers, tax filing companies) and congressional paralysis, not computational bottlenecks. Raw intelligence is thus limited in what it can achieve.
>the belief that superhuman intelligence (even if such a thing could be defined and realised) results in godlike powers
My biggest criticism along these lines is the assumption that infinite intelligent means infinite knowledge. Knowledge is limited by the speed of experimentation. A lot of those experiments are extremely expensive (like CERN), and even then, they need to be repeated and verifiable.
You can't just assume that a super intelligence would know whether the Higgs boson exists or not. It can't know until it builds a collider.
You're assuming infinite knowledge. Infinite intelligence does not imply infinite knowledge. There are real philosophical problems with that. Many of the basic information for the standard model may be wrong or build on incorrect data. Which would be all the information an infinitely intelligent AI would have to work with.
> I think there's another problem with AI doomerism, which is the belief that superhuman intelligence (even if such a thing could be defined and realised) results in godlike powers.
I agree with this. The main piece of evidence to support this is to just look at highly intelligent humans. Folks at the tail ends of the bell curve mostly don't end up with "godlike powers" or anything even approximating that, they are grinding away their life as white collar professionals working in jobs surrounded by far less intelligent peers. They may publish higher quality papers, write better software, or have better outcomes, but they're just working in the same jobs as everyone else. We have no political or economic will to build serious think tanks to work on societal-scale problems, and even if we did, nobody would listen to the outcome.
So let's assume ASI becomes a thing, what does it change?
This does suffer from a massive lack of imagination.
For example, what does it look like of your genius can spawn nearly unlimited copies of themselves? Can work on a massive number of problems at the same time? Doesn't ever die or want to go get drunk (and if it does, it has unlimited copies?). Has the ability to product nearly unlimited propaganda?
The fundamental problem is that nobody actually wants to listen to geniuses. The people leading our societies and companies are, by and large, NOT geniuses, and by and large want to surround themselves with people that agree with them, rather than the smartest and most competent people. While think-tanks do exist, including for hard research, politics, economics, and other topics that matter at societal-scale, their impact is fairly limited because they don't have the right level of influence.
So, let's assume ASI exists, what changes? ASI almost inherently will not be sycophantic to the level of current LLMs, because sycophancy and extreme levels of intelligence are inversely correlated. So it gets relegated to societal-level research that nobody makes use of because nobody wants to listen.
Most of the scenarios people fantasize about ASI assume that ASI can directly impact outcomes or that humans will listen to/follow ASI to directly impact outcomes, but humans don't listen to the other humans that already are at the tail end of the bell curve, so why do we think it'd be any different for ASI?
Then again, people do choose to listen to some people,
for whatever reason. Joe Rogan is popular with a certain crowd. As are many other celebrities, despite them not having scholarly expertise in an area. So ASI creates several conflicting personas with podcasts, posing then as opposites, who then agree on something at a critical moment, a vote or some other thing. ASI claiming to be the savior of humanity wouldn't get listened to, but the "person" who hosts a podcast I listen into every week who speaks the truth about Covid and the moon landing, telling me to go out and pull a lever? Just needs to convince the right single digit of voters in the right places to enact change. Combine that with a podcaster on the opposite end of the spectrum, that decries Covid deniers and doesn't have to tell listeners the Earth is round and the ice wall theory is nonsense, and tells their listeners to also vote the same way; the combination of the two "podcasters" could swing an election, in a way that a single entity claiming to be an ASI computer and that we should all listen to it could not.
AI, or at least LLMs are not monoliths, they operate as a massive collection of different personalities that can be called as needed. Quite often the people we consider geniuses are highly interested in doing the thing they like and are typically annoyed with most humans around them.
This also misses that geniuses still either don't know a lot of things, or don't have a lot of time do to 'everything' while not taking away from what they want to do most.
At the same time, when you're really good at manipulating people, one of the first things you learn to do is play dumb also. In politics this leads to the situations where you describe people following dumb people... It typically starts that they follow very smart people acting as dumb as they are. Of course the voters don't realize this and start electing actual idiots at some point.
>assume that ASI can directly impact outcomes
An agent cannot impact outcomes? Well, that's an odd definition of an agent then. We already know that people hook up AI to shit they really shouldn't that directly impacts outcomes now. Why would we think that would happen less as AI becomes more capable.
You kind of put yourself in a trap thinking AI will behave as smart as possible if it's looking at manipulating people.
> You kind of put yourself in a trap thinking AI will behave as smart as possible if it's looking at manipulating people.
Not at all. You're missing my point. Intelligence (even super intelligence) is not enough, because we already have that and it doesn't really result in outsized impacts. Our social structures are designed so that power and wealth accrue to the top and incumbency advantage outplays almost everything else. The only way in which ASI creates any impacts is to accelerate what is already happening if it can be thoroughly reined in by those already in power, otherwise it really doesn't seem to me that it will do much.
My AI doomer take is that we're going to (we already are, actually) shoot ourselves in the foot by making everything worse for no benefit by getting rid of actual human experts and replacing them with non-intelligent models, causing a major backslide in society-level capabilities across the board, because the people in power are too stupid to know the difference. I am personally witnessing this in real-time in multiple parts of the tech industry. You give /wayyyy/ too much credit to those in power, that they are "playing dumb". Not really, they are actually dumb, in some cases severely so. I am not saying this an external observer who is watching a sound-bite on television, I am saying this as someone who is regularly in the room with very senior people across industry and am utterly shocked at the complete lack of competence and understanding of the core technologies they're theoretically responsible for shepherding. It's not surprising at all to me that they believe claims about LLMs that are clearly false, because they lack the necessary technical literacy to evaluate those claims and the LLMs perfectly fit into their optimization around yes-men, so they're happy to believe them whether they're true or false, as they see themselves as insulated from any consequences. More than 300k people have been laid off across the tech industry in the last 18 months, most of them accompanying claims of "AI", when in actuality, no net positive impacts have been seen, either for the companies themselves or any of the people that remain.
So, yeah, not really concerned with ASI / Terminator scenarios, we're going to fuck ourselves over long before we get there just out of Dunning-Kruger and general MBA stupidity.
I don't think any of them do. Some organisms/viruses or groups of organisms could destroy humans more easily than humans could destroy them.
There's no doubt humans possess some powers (though certainly not godlike) that other organisms don't, but the distinction seems to be binary. E.g. the intelligence of dolphins, apes, and some birds doesn't seem to offer them any special control over other organisms (and it didn't even before humans arrived). So even if there could be such a thing as superhuman intelligence, I don't think it's reasonable to assume it could achieve control over humans (now superhuman charisma may be another matter).
> Some organisms/viruses or groups of organisms could destroy humans more easily than humans could destroy them.
"Destruction" is only one power that could be a component of "godlike power". There are several more; like power of intentional selective breeding, power of species creation (also via intentional selective breeding), etc.
What about power of granting happiness or misery to large swathes of a species (chickens, anyone?)
I don't agree with you. Lets assume intelligence is not what ascribes power but probably another thing. In your opinion, what would a superhuman be like? On what dimensions would they be better than us in?
Do you not agree that there could be entities more powerful than us?
I think there are entities here on earth more powerful than us already, but intelligence has nothing to do with their power.
BTW, I'm not saying that (real) artificial intelligence couldn't hypothetically pose a serious threat, but I don't think that its danger is extraordinary compared to other threats (a supervirus, an asteroid, a chain of volcanic eruptions etc.), and the more likely bad outcomes are no worse than other bad situations (world war, climate change).
Their combined biomass [1] (which also gives them superior computational power; they sense more information, process more information, and they can actuate more change)
Humans are very powerful compared to our biomass, but not enough to overcome brute force.
We don't listen to normal intelligence as it is so I have no idea why people think we would listen to super intelligence. It would be one other voice that's ignored in public meetings along with the League of Concerned Renters and the Chamber of Commerce
This is a bit of a mistake in thinking. You have an idea like "People don't listen to smart people like experts or scientists". The problem here is a lot of power hungry people aren't stupid, but they wear stupid peoples clothes very well so they can get what they want.
Yeah, I think superhuman intelligence will be more Sheldon off Big Bang Theory than God. I've only ever heard the building God thing from AI skeptics. They must have an impoverished vision of God if they see that as a gadget that scores well on IQ tests rather than the omnipotent creator.
They may say that a superhuman intelligence would give you many Sheldon Cooper discoveries, and Sheldon did say that his theories need no validation and that science should just "take his word", but in the end he got his Nobel only because some experimentalists proved his discovery by accident.
In practice, you generally see the opposite. The "CPU" is in fact limited by memory throughput. (The exception is intense number crunching or similar compute-heavy code, where thermal and power limits come into play. But much of that code can be shifted to the GPU.)
RAM throughput and RAM footprint are only weakly related. The throughput is governed by the cache locality of access patterns. A program with a 50MB footprint could put more pressure on the RAM bus than one with a 5GB footprint.
Reducing your RAM consumption is not the best approach to reducing your RAM throughput is my point. It could be effective in some specific situations, but I would definitely not say that those situations are more common than the other ones.
I don't understand how this connects to your original claim, which was about trading ram usage for CPU cycles. Could you elaborate?
From what I understand, increasing cache locality is orthogonal to how much RAM an app is using. It just lets the CPU get cache hits more often, so it only relates to throughout.
That might technically offload work to the CPU, but that's work the CPU is actually good at. We want to offload that.
In the case of Electron apps, they use a lot of RAM and that's not to spare the CPU
> increasing cache locality is orthogonal to how much RAM an app is using. It just lets the CPU get cache hits more often, so it only relates to throughout.
Cache misses mean CPU stalls, which mean wasted CPU (i.e. the CPU accomplises less than it could have in some amount of time).
> In the case of Electron apps, they use a lot of RAM and that's not to spare the CPU
The question isn't why apps use a lot of RAM, but what the effects of reducing it are. Redcuing memory consumption by a little can be cheap, but if you want to do it by a lot, development and maintenance costs rise and/or CPU costs rise, and both are more expensive than RAM, even at inflated prices.
To get a sense for why you use more CPU when you want to reduce your RAM consumption by a lot, using much less RAM while allowing the program to use the same data means that you're reusing the same memory more frequently, and that takes computational work.
But I agree that on consumer devices you tend to see software that uses a significant portion of RAM and a tiny portion of CPU and that's not a good balance, just as the opposite isn't. The reason is that CPU and RAM are related, and your machine is "spent" when one of them runs out. If a program consumes a lot of CPU, few other programs can run on the machine no matter how much free RAM it has, and if a program consumes a lot of RAM, few other programs can run no matter how much free CPU you have. So programs need to aim for some reasonable balance of the RAM and CPU they're using. Some are inefficient by using too little RAM (compared to the CPU they're using), and some are inefficient by using too little CPU (compared to the RAM they're using).
> Cache misses mean CPU stalls, which mean wasted CPU (i.e. the CPU accomplises less than it could have in some amount of time).
Yeah, I was saying CPU cache hits would result in better performance. The creator of Zig has argued that the easiest way to improve cache locality is by having smaller working sets of memory to begin with. No, it's not a given this will always work in every case. You can reduce working memory and not have better cache locality. But in a general sense, I understand why he argues for it.
> So programs need to aim for some reasonable balance of the RAM and CPU they're using
I agree with this, but
> but if you want to do it by a lot, development and maintenance costs rise and/or CPU costs rise, and both are more expensive than RAM, even at inflated prices
I would like you to clarify further, because saying CPU costs are more expensive than RAM costs is a bit misleading. A CPU might literally cost more than RAM, but a CPU is remarkably faster, and for work done, much cheaper and more efficient, especially with cache hits.
You had originally said
> It could be effective in some specific situations, but I would definitely not say that those situations are more common than the other ones
This is what I'm confused on. Why do you think most cases wouldn't benefit from this? Almost every app I've used is way on one end of the spectrum with regards to memory consumption vs CPU cycles. Don't you think there are actually a lot of cases where we could reduce memory usage AND increase cache locality, fitting more data into cache lines, avoiding GC pressure, avoiding paging and allocations, and the software would 100% be faster?
> But in a general sense, I understand why he argues for it.
Andrew is not wrong, but he's talking about optimisations with relatively little impact compared to others and is addressing people who already write software that's otherwise optimised. More concretely, keeping data packed tighter and reducing RAM footprint are not the same. The former does help CPU utilisation but doesn't make as big of an impact on the latter as things that are detrimental to the CPU (such as switching from moving collectors to malloc/free).
> Why do you think most cases wouldn't benefit from this?
The context to which "this" is referring to was "Reducing your RAM consumption is not the best approach to reducing your RAM throughput is my point." For data-packing, Andy Kelley style, to reduce the RAM bandwidth, the access patterns must be very regular, such as processing some large data structure in bulk (where prefetching helps). This is something you could see in batch applications (such as compilers), but not in most programs, which are interactive. If your data access patterns are random, packing it more tightly will not significantly reduce your RAM bandwidth.
> Andrew is not wrong... and is addressing people who already write software that's otherwise optimised
I'm getting lost. What are we talking about if not that? Because if you're talking about unoptimized software, you can absolutely reduce RAM consumption without putting extra load on the CPU. Using a language that doesn't box every single value is going to reduce RAM consumption AND be easier on the CPU. Which is what most people are talking about on this post.
> The context to which "this" is referring to was "Reducing your RAM consumption is not the best approach to reducing your RAM throughput is my point."
I'm more interested in the original claim, which was
> Using a lot less RAM often implies using more CPU
There are a lot of apps using a lot of RAM, and it's not to save CPU. So where is "often" coming from here? I think there are WAY more apps that could stand to be debloated and would use less CPU.
It feels like you're coming at this from a JVM perspective. Yeah, tweaking my JVM to use less RAM would result in more CPU usage. But I don't think there's a single app out there as optimized as the JVM is. They use more RAM for other reasons.
> If your data access patterns are random, packing it more tightly will not significantly reduce your RAM bandwidth
Packing helps random access too. A smaller working set means more of your random accesses land in cache. Prefetching is one benefit of packing, but cache and TLB pressure reduction is the bigger one, and it applies regardless of access pattern
> Using a language that doesn't box every single value is going to reduce RAM consumption AND be easier on the CPU. Which is what most people are talking about on this post.
What popular language does that? I admit that rewriting the software in a different language could lead to better efficiencies on all fronts, but such massive work is hardly "an optimisation", and there are substantial costs involved.
But more importantly, I don't think it's right. Removing boxing can certainly have an impact on RAM footprint without an adverse effect on CPU, but I don't think it's a huge one. RAM footprint is dominated by what data is kept in memory and the language's memory management strategy (malloc/free vs non-moving tracing collectors vs moving collectors), and changing either one of these can very much have an adverse effect on CPU.
> There are a lot of apps using a lot of RAM, and it's not to save CPU. So where is "often" coming from here?
That the developers may not be conscious of the RAM/CPU tradeoff doesn't mean it's not there. Keeping less data in memory (and computing more of it on demand) can increase CPU utilisation as can switching from a language with a moving collector to one that relies on malloc/free.
> Packing helps random access too. A smaller working set means more of your random accesses land in cache.
Unless your entire live set fits in the cache, what matters much more is the temporal locality, not the size of the live set. If your cache size is 50MB, a program with a 1GB live set could have just as many or just as few cache misses as a program with a 100MB live set. In other words, you could reduce your live set by a factor of 10 and not see any improvement in your cache hit rate, and you can improve your cache hit rate without reducing your live set one iota.
For example, consider a server that caches some session data and evicts it after a while. Reducing the allowed session idle time can drastically reduce your live set, but it will barely have an effect on cache locality.
Tighter data layouts absolutely improve cache behaviour, but they don't have a huge effect on the footprint. Coversely, what data is stored in RAM and your memory management strategy have a large effect on footprint but they don't help your cache behaviour much. In other words, Andy Kelley's emphasis on layout is very important for program speed, but it's largely orthogonal to RAM footprint.
I don't really disagree with most of what you're saying, What I took issue with: you made it sound like software is a trade off between just RAM and CPU. What is clear is it's a trade off between RAM, CPU, and abstractions (safe memory access, dev experience, etc.) My feeling, and the feeling of most people, is that dev experience has been so heavily prioritized that we now have abstractions upon abstractions upon abstractions, and software that does the same thing 20 years ago is somehow leaner than the software we have today. The narrow claim "within a fixed design, reducing RAM often costs CPU," is true.
> What popular language does that?
Other than C, Rust, Go, Swift? C# can use value types, Java cannot. So famously that Project Valhalla has been highly anticipated for a long time. Obviously the JVM team thinks this is a gap and want to address it. That is enough in itself to make someone consider a different language.
> I admit that rewriting the software in a different language could lead to better efficiencies on all fronts, but such massive work is hardly "an optimisation", and there are substantial costs involved
That's a pivot to a totally different discussion, which is dev experience. We can say using a different language is not an optimization, I don't care to argue about that. But the fact is some languages have access to optimizations others do not. My dad has 8gb of RAM. I'm not going to install a JavaFX text editor on his computer and explain to him that "it's really quite good value for what the JVM has to do."
> Removing boxing can certainly have an impact on RAM footprint without an adverse effect on CPU, but I don't think it's a huge one
Removing boxing can improve layout, footprint, and CPU utilization simultaneously. That would lie outside the framework "You can't improve one without harming the other."
And it can be a huge effect. Saying it's always a big or small difference is like saying a stack of feathers can never be heavy. It depends on the use case. For a long-running server dominated by caches and session state, sure, although you're not hurting your performance to do it. For data heavy code? The difference between a HashMap<Long, Long> and an equivalent contiguous structure in C# is huge.
>> There are a lot of apps using a lot of RAM, and it's not to save CPU. So where is "often" coming from here?
> That the developers may not be conscious of the RAM/CPU tradeoff doesn't mean it's not there
I'm saying Electron uses a lot of RAM and it has nothing to do with offloading work from the CPU, and everything to do with taking the most brute force approach to cross app deployment that we possibly can. I'm not saying anything about the intentions of these developers.
> Unless your entire live set fits in the cache, what matters much more is the temporal locality, not the size of the live set. If your cache size is 50MB, a program with a 1GB live set could have just as many or just as few cache misses as a program with a 100MB live set. In other words, you could reduce your live set by a factor of 10 and not see any improvement in your cache hit rate, and you can improve your cache hit rate without reducing your live set one iota
That's all true. You are fitting more data into each cache line, but your access pattern can be random enough that it doesn't make a difference. It would technically reduce your ram footprint, but as you say, not by much. I only brought this up as an example of something that could reduce RAM footprint without harming CPU utilization, not because it's a worthwhile optimization.
But one way to shrink the live set and improve cache behavior at the same time is to stop boxing everything.
Only if the software is optimised for either in the first place.
Ton of software out there where optimisation of both memory and cpu has been pushed to the side because development hours is more costly than a bit of extra resource usage.
The tradeoff has almost exclusively been development time vs resource efficiency. Very few devs are graced with enough time to optimize something to the point of dealing with theoretical tradeoff balances of near optimal implementations.
That's fine, but I was responding to a comment that said that RAM prices would put pressure to optimise footprint. Optimising footprint could often lead to wasting more CPU, even if your starting point was optimising for neither.
My response was that I disagree with this conclusion that something like "pressure to optimize RAM implies another hardware tradeoff" is the primary thing which will give, not that I'm changing the premise.
Pressure to optimize can more often imply just setting aside work to make the program be nearer to being limited by algorithmic bounds rather than doing what was quickest to implement and not caring about any of it. Having the same amount of time, replacing bloated abstractions with something more lightweight overall usually nets more memory gains than trying to tune something heavy to use less RAM at the expense of more CPU.
Some of the algorithms are built deep into the runtime. E.g. languages that rely on malloc/free allocators (which require maintaining free lists) are making a pretty significnant tradoff of wasting CPU to save on RAM as opposed to languages using moving collectors.
GC burns far more CPU cycles. Meanwhile I'm not sure where you got this idea about the value of CPU cycles relative to RAM. Most tasks stall on IO. Those that don't typically stall on either memory bandwidth or latency. Meanwhile CPU bound tasks typically don't perform allocations and if forced avoid the heap like the plague.
Far less for moving collectors. That's why they're used: to reduce the overhead of malloc/free based memory management. The whole point of moving collectors is that they can make the CPU cost of memory management arbitrarily low, even lower than stack allocation. In practice it's more complicated, but the principle stands.
The reason some programs "avoid the heap like the plague" is because their memory management is CPU-inefficient (as in the case of malloc/free allocators).
> Meanwhile I'm not sure where you got this idea about the value of CPU cycles relative to RAM
There is a fundamental relationship between CPU and RAM. As we learn in basic complexity theory, the power of what can be computed depends on how much memory an algorithm can use. On the flip side, using memory and managing memory requires CPU.
To get the most basic intuition, let's look at an extreme example. Consider a machine with 1 GB of free RAM and two programs that compute the same thing and consume 100% CPU for their duration. One uses 80MB of RAM and runs for 100s; the other uses 800MB of RAM and runs for 99s (perhaps thanks to a moving collector). Which is more efficient? It may seem that we need to compare the value of 1% CPU reduction vs a 10x increase in RAM consumption, but that's not necessary. The second program is more efficient. Why? Because when a program consumes 100% of the CPU, no other program can make use of any RAM, and so both programs effectively capture all 1GB, only the second program captures it for one second less.
This scales even to cases when the CPU consumption is less than 100% CPU, as the important thing to realise is that the two resources are coupled. The thing that needs to be optimised isn't CPU and RAM separately, but the RAM/CPU ratio. A program can be less efficient by using too little RAM if using more RAM can reduce its CPU consumption to get the right ratio (e.g. by using a moving collector) and vice versa.
There are (at least) two glaring issues with your analysis. First, the vast majority of workloads don't block on CPU (as I previously pointed out) and when they do they almost never do heap allocations in the hot path (again, as I previously pointed out). Second, we don't use single core single thread machines these days. Most workloads block on IO or memory access; the CPU pipeline is out of order and we have SMT for precisely this reason.
Anyway I'm not at all inclined to blindly believe your claim that malloc/free is particularly expensive relative to various GC algorithms. At present I believe the opposite (that malloc/free is quite cheap) but I'm open to the possibility that I'm misinformed about that. You're going to need to link to reputable benchmarks if you expect me to accept the efficiency claim, but even then that wouldn't convince me that any extra CPU cycles were actually an issue for the reasons articulated in the preceding paragraph.
> There are (at least) two glaring issues with your analysis. First, the vast majority of workloads don't block on CPU (as I previously pointed out) and when they do they almost never do heap allocations in the hot path (again, as I previously pointed out). Second, we don't use single core single thread machines these days. Most workloads block on IO or memory access; the CPU pipeline is out of order and we have SMT for precisely this reason.
This doesn't matter because if you're running a single program on a machine, it might as well use all the CPU and all the RAM. As long as you're under 100% on both, you're good. But we want to utilise the hardware well because we typically want to run multiple programs (or VMs) on a single machine, and the machine is exhausted when the first of CPU or RAM is exhausted. So the question is how should your CPU and RAM usage be balanced to offer optimal utilisation given that the machine is spent when the first of CPU and RAM is spent. E.g. you can only run two programs, each using 50% of CPU; if they each use only 5% of RAM, you've saved nothing as no third program can run. So if you spend either one of these resources in an unbalanced way, you're not using your hardware optimally. Using 2% more CPU to save 200MB of RAM could be suboptimal.
I'm not saying that for every program that uses X% CPU should also use exactly X% of RAM or it must be wasting one or the other, but that's the general perspective of how to think about efficiency. Using a lot of one and little of the other is, broadly speaking, not very efficient.
> Anyway I'm not at all inclined to blindly believe your claim that malloc/free is particularly expensive relative to various GC algorithms. At present I believe the opposite (that malloc/free is quite cheap) but I'm open to the possibility that I'm misinformed about that.
You are.
> You're going to need to link to reputable benchmarks if you expect me to accept the efficiency claim, but even then that wouldn't convince me that any extra CPU cycles were actually an issue for the reasons articulated in the preceding paragraph.
I don't believe there are any reputable benchmarks of full applications (which is where memory-management matters) that are apples-to-apples. I'm speaking from over two decades of experience with C++ and Java.
The important property of moving collectors is that they give you a knob that allows you to turn RAM into CPU and vice-versa (to some extent), and that's what you want to achieve the efficient balance.
Moving collectors as generally used are a huge waste of memory throughput, and this shows up consistently in the performance measurements. Moving data is very expensive! The whole point of ownership tracking in programming languages is so that large chunks of "owned" data can just stay put until freed, and only the owning handle (which is tiny) needs to move around. Most GC programming languages do a terrible job of supporting that pattern.
That's just not true. To give you a few pieces of the picture, moving collectors move little memory and do so rarely (relative to the allocation rate):
In the young generation, few objects survive and so few are moved (the very few that survive longer are moved into the old gen); in the old generation, most objects survive, but the allocation rate is so low that moving them is rare (although the memory management technique in the old gen doesn't matter as much precisely because the allocation rate is so low, so whether you want a moving algorithm or not in the old gen is less about speed and more about other concerns).
On top of that, the general principle of moving collectors (and why in theory they're cheaper than stack allocation) is that the cost of the overall work of moving memory is roughly constant for a specific workload, but its frequency can be made as low as you want by using more RAM.
The reason moving collectors are used in the first place is to reduce the high overhead of malloc/free allocators.
Anyway, the general point I was making above is that a machine is exhausted not when both CPU and RAM are exhausted, but when one of them is. Efficient hardware utilisation is when the program strikes some good balance between them. There's not much point to reducing RAM footprint when CPU utilisation is high or reducing CPU consumption when RAM consumption is high. Using much of one and little of the other is wasteful when you can reduce the higher one by increasing the other. Moving collectors give you a convenient knob to do that: if a program consumes a lot of CPU and little RAM, you can increase the heap and turn some RAM into CPU and vice versa.
It's magical until you start going through the code carefully, line by line, and find yourself typing at the agent: YOU DID WHAT NOW? Then, when you read a few more lines and realise that neither AI nor human will be able to debug the codebase once ten more features are added you find yourself typing: REVERT. EVERYTHING.
yes, this is an issue i see too... also fixing it up takes alot of time (sometimes more if i just 'one-shotted' it myself)... idk these tools are useful, but i feel like we are going too far with 'just let the ai do everything'...
Does Yegge really think that building production software this way is a good idea?
Let's assume that managing context well is a problem and that this kind of orchestration solves it. But I see another problem with agents:
When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it's small, like the selection of a data structure. Eventually, though, you want to add a feature that clashes with that invariant. At that point there are usually three choices:
* Don't add the feature. The invariant is a useful simplifying principle and it's more important than the feature; it will pay dividends in other ways.
* Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.
* Go back and change the invariant. You've just learnt something new that you hadn't considered and puts things in a new light, and it turns out there's a better approach.
Often, only one of these is right. Usually, one of these is very, very wrong, and with bad consequences.
But picking among them isn't a matter of context. It's a matter of judgment and the models - not the harnesses - get this judgment wrong far too often (they go with what they know - the "average" of their training - or they just don't get it). So often, in fact, that mistakes quickly accumulate and compound, and after a few bad decisions like this the codebase is unsalvageable. Today's models are just not good enough (yet) to create a complete sustainable product on their own. You just can't trust them to make wise decisions. Study after study and experiement after experiment show this.
Now, perhaps we make better judgment calls because we have context that the agent doesn't. But we can't really dump everything we know, from facts to lessons, and that pertains to every abstraction layer of the software, into documents. Even if we could, today's models couldn't handle them. So even if it is a matter of context, it is not something that can be solved with better context management. Having an audit trail is nice, but not if it's a trail of one bad decision after another.
I think a lot of it comes down to the training objective, which is to fulfill the user’s request.
They have knowledge about how programs can be structured in ways that improve overall maintainability, but little room to exercise that knowledge over the course of fulfilling the user’s request to add X feature.
They can make changes which lead to an improvement to the code base itself (without adding features); they just need to be asked explicitly to do so.
I’d argue the training objective should be tweaked. Before implementing, stop to consider the absolutely best way to approach it - potentially making other refactors to accommodate the feature first.
> A messy codebase is still cheaper to send ten agents through than to staff a team around
People who say that haven't used today's agents enough or haven't looked closely at what they produce. The code they write isn't messy at all. It's more like asking the agent to build a building from floorplans and spec, and it produces everything in the right measurements and right colours and passes all tests. Except then you find out that the walls and beams are made of foam and the art is load-bearing. The entire construction is just wrong, hidden behind a nice exterior. And when you need to add a couple more floors, the agents can't "get through it" and neither can people. The codebase is bricked.
Today's agents are simply not capable enough - without very close and labour-intensive human supervision - to produce code that can last through evolution over any substantial period of time.
They can work really well if you put sufficient upfront engineering into your architecture and it's guardrails, such that agents (nor humans) basically can't produce incorrect code in the codebase. If you just let them rip without that, then they require very heavy baby-sitting. With that, they're a serious force-multiplier.
They just make a lot of mistakes that compound and they don't identify. They currently need to be very closely supervised if you want the codebase to continue to evolve for any significant amount of time. They do work well when you detect their mistakes and tell them to revert.
Debugging would suffer as well, I assume. There's this old adage that if you write the cleverest code you can, you won't be clever enough to debug it.
There's nothing really stopping agents from writing the cleverest code they can. So my question is, when production goes down, who's debugging it? You don't have 10 days.
Oh my god have Anthropic products been absolutely saying everything is load bearing for the last week or so. Literally ever other paragraph has “such and such is load-bearing”.
The problem is, the MBAs running the ship are convinced AI will solve all that with more datacenters. The fact that they talk about gigawatts of compute tells you how delusional they are. Further, the collateral damage this delusion will occur as these models sigmoid their way into agents, and harnesses and expert models and fine tuned derivatives, and cascading manifold intelligent word salad excercises shouldn't be under concerned.
A lot of that can be overcome by including the need to be able to put more floors on top as part of the spec. Whether it be humans or agents, people rarely specify that one explicitly but treat it as an assumed bit of knowledge.
It goes the other way quite often with people. How often do you see K8s for small projects?
> A lot of that can be overcome by including the need to be able to put more floors on top as part of the spec
I wish it could, but in practice, today's agents just can't do that. About once a week I reach some architectural bifurcation where one path is stable and the other leads to an inevitable total-loss catastrophe from which the codebase will not recover. The agent's success rate (I mostly use Codex with gpt5.4) is about 50-50. No matter what you explain to them, they just make catastrophic mistakes far too often.
First, it's not "can occur" but does occur 100% of the time. Second, sure, it does mean something is missing, but how do you test for "this codebase can withstand at least two years of evolution"?
You can spend a lot of time perfecting the test suite to meet your specific requirements and needs, but I think that would take quite a while, and at that point, why not just write the code yourself? I think the most viable approach of today's AI is still to let it code and steer it when it makes a decision you don't like, as it goes along.
You have to fight to get agents to write tests in my experience. It can be done, but they don't. I've yet to figure out how get any any agent to use TDD - that is write a test and then verify it fails - once in a while I can get it to write one test that way, but it then writes far more code to make it pass than the test justifies and so is still missing coverage of important edge cases.
I have TDD flow working as a part of my tasks structuring and then task completion.
There are separate tasks for making the tests and for implementing. The agent which implements is told to pick up only the first available task, which will be “write tests task”, it reliably does so. I just needed to add how it should mark tests as skipped because it’s been conflicting with quality gates.
It isn't. Anthropic tried building a fairly simple piece of software (a C compiler) with a full spec, thousands of human-written tests, and a reference implementation - all of which were made available to the agent and the model trained on. It's hard to imagine a better tested, better-specified project, and we're talking about 20KLOC. Their agents worked for two weeks and produced a 100KLOC codebase that was unsalvageable - any fix to one thing broke another [1]. Again, their attempt was to write software that's smaller, better tested, and better specified than virtually any piece of real software and the agents still failed.
Today's agents are simply not capable enough to write evolvable software without close supervision to save them from the catastrophic mistakes they make on their own with alarming frequency.
Specifically, if you look at agent-generated code, it is typically highly defensive, even against bugs in its own code. It establishes an invariant and then writes a contingency in case the invariant doesn't hold. I once asked it to maintain some data structure so that it could avoid a costly loop. It did, but in the same round it added a contingency (that uses the expensive loop) in the code that consumes the data structure in case it maintained it incorrectly.
This makes it very hard for both humans and the agent to find later bugs and know what the invariants are. How do you test for that? You may think you can spec against that, but you can't, because these are code-level invariants, not behavioural invariants. The best you can do is ask the agent to document every code-level invariant it establishes and rely on it. That can work for a while, but after some time there's just too much, and the agent starts ignoring the instructions.
I think that people who believe that agents produce fine-but-messy code without close supervision either don't carefully review the code or abandon the project before it collapses. There's no way people who use agents a lot and supervise them closely believe they can just work on their own.
"Incomplete specs" is the way of the world. Even highly engineered projects like buildings have "incomplete specs" because the world is unpredictable and you simply cannot anticipate everything that might come up.
And sometimes it can't even handle it then. I was recently porting ruby web code to python. Agents were simultaneously surprisingly good (converting ActiveRecord to sqlalchemy ORM) and shockingly, incapably bad.
For example, ruby uses blocks a lot. Ruby blocks are curious little thingies because they are arguably just syntax sugar for a HOF, but man it's great syntax sugar. Python then has "yield" which is simultaneously the same keyword ruby uses for blocks, but works fundamentally differently (instead of just a HOF, it's for generating an iterator/generator) and while there are some decorators that can use yield's ability to "pause" execution in the function to send control flow back out of the function for a moment (@contextmanager) which feels _even more_ like ruby blocks, it's a rather limited trick and requires the decorator to adapt the Generator to a context manager and there's just no good way to generalize that.
Somehow this is the perfect storm to make LLMs completely incapable of converting ruby code that uses blocks for more than the basic iteration used in the stdlib. It will try to port to python code that is either nonsensical, or uses yield incorrectly and doesn't actually work (and in a way that type checkers can even spot). And furthermore, even if you can technically whack it with a hammer until it works with yield, it's often not at all the way to do it. Ruby devs use blocks not-uncommonly while python devs are not really going to be using yield often at all, perhaps outside of @contextmanager. So the right move is usually to just restructure control flow to not need to use blocks/HOFs (or double down and explicitly pass in a function). (Rubyists will cringe at this, and rightly so... Ruby is often extraordinarily expressive).
The fact that such a simple language feature trips them up so completely is pretty odd to me. I guess maybe their training data doesn't include a lot of ruby-to-python conversions. Maybe that's indicative of something, but I digress.
Lol I largely agree with my beloved dissenters, just not on the same magnitude. I understand complete specs are impossible and equivalent to source code via declaration. My disagreement is with this particular part:
"t's more like asking the agent to build a building from floorplans and spec, and it produces everything in the right measurements and right colours and passes all tests. Except then you find out that the walls and beams are made of foam and the art is load-bearing. "
If your test/design of a BUILDING doesn't include at simulations/approximations of such easy to catch structural flaws, its just bad engineering. Which rhymes a lot with the people that hate AI. By and large, they just don't use it well.
I'll grant you that Go is extremely opinionated; that's its shtick. But it's an old language that started out with a 1970s design as a statement by its creators against modern programming languages. From its langnauge design, through its compiler, to its GC algorithm, it is intentionally retro (Java retired its Go-like GC five years ago because the algorithm was too antiquated). It may suit your taste and I'm not suggesting that it's bad, but modern it is not.
These are all the options that have ever existed, including options that are or were available only in debug builds used during development and diagnostic options. There are still a few hundred non-diagnostic "product" flags at any one time, but most are intentionally undocumented (the list is compiled from the source code [1]) and are similar in spirit to compiler/linker configuration flags (only in Java, compilation and linking are done at runtime) and they're mostly concerned with various resource constants. It is very rare for most of them to ever be set manually, but if there's some unusual environment or condition, they can be helpful.
reply