I think counter to the assumption of myself (and many), for long form agent coding tasks, models are not as easily hot swappable as I thought.
I have developed decent intuition on what kinds of problems Codex, Claude, Cursor(& sub-variants), Composer etc. will or will not be able to do well across different axes of speed, correctness, architectural taste, ...
If I had to reflect on why I still don't use Gemini, it's because they were late to the party and I would now have to be intentional about spending time learning yet another set of intuitions about those models.
I feel like "prompting language" doesn't translate over perfectly either. It's like we become experts at operating a particular AI agent.
I've been experimenting with small local models and the types of prompts you use with these are very different than the ones you use with Claude Code. It seems less different between Claude, Codex, and Gemini but there are differences.
It's hard to articulate those differences but I think that I kind of get in a groove after using models for a while.
This was a common argument against LLMs, that the space of possible next tokens is so vast that eventually a long enough sequence will necessarily decay into nonsense, or at least that compounding error will have the same effect.
Problem is, that's not what we've observed to happen as these models get better. In reality there is some metaphysical coarse-grained substrate of physics/semantics/whatever[1] which these models can apparently construct for themselves in pursuit of ~whatever~ goal they're after.
The initially stated position, and your position: "trying to hallucinate an entire world is a dead-end", is a sort of maximally-pessimistic 'the universe is maximally-irreducible' claim.
And going back a little further, it was thought that backpropagation would be impractical, and trying to train neural networks was a dead end. Then people tried it and it worked just fine.
> Problem is, that's not what we've observed to happen as these models get better
Eh? Context rot is extremely well known. The longer you let the context grow, the worse LLMs perform. Many coding agents will pre-emptively compact the context or force you to start a new session altogether because of this. For Genie to create a consistent world, it needs to maintain context of everything, forever. No matter how good it gets, there will always be a limit. This is not a problem if you use a game engine and code it up instead.
We're talking about context here though. The first couple seconds of Genie are great, but over time it degrades. It will always degrade, because it's hallucinating a world and needs to keep track of too many things.
That has traditionally been the problem with these types of models, but Genie is supposed to maintain coherence up to 60 seconds.
I've tried using it a couple of times, but can't get in. It is either down or hopelessly underprovisioned by Google. Do you have any links to videos showing that the quality degrades after only a few seconds?
Edit: no, it just doesn't work in Firefox. It works incredibly well, at least in Chrome, and it does not lose coherence to any great extent. The controls are terrible, though.
People could trivially switch their search engine to Bing or Yahoo, but they don't.
If ads are so overpriced, how big is your short position on google? Also ads are extremely inefficient in terms of conversion. Ads rendered by an intelligent, personalized system will be OOM more efficient, negating most of the "overvalue".
I'm not saying they should serve ads. It's a terrible strategy for other reasons.
Funny that you mention Yahoo, as in my mind they're the perfect example of what the poster above you noted: people quickly switched to Google once a better alternative to Yahoo appeared.
> People could trivially switch their search engine to Bing or Yahoo, but they don't.
Well those are obviously worse products.
> If ads are so overpriced, how big is your short position on google?
I hate hearing this stupid, stupid line.
Most companies are run by neanderthals with more money than brains. Companies burn money on advertising because why not? Making your product better is hard and takes time, advertising is the easiest thing you can do. Does it work? Not really, no, but you get extra business for as close to zero effort you can possibly get. Hit a wall? Just advertise more!
> Ads rendered by an intelligent, personalized system will be OOM more efficient, negating most of the "overvalue".
This is exactly what people said about personalized ads. "No you don't understand! It's not like a billboard!"
And that's true, but consumers are not fucking braindead, and there's also the laws of economics. If I have 50 bucks, I'm not spending 20 fucking dollars on your dumbass paint, no matter how much you advertise it. And that's not a me thing, that's a consumer thing. You can spend 1 quadrillion dollars advertising ferraris and guess what - you will STILL quickly saturate that market and hit a hard ceiling. Because consumer's can't afford it.
And that's not even touching on the fact that most of the metrics around advertisements are just obviously bullshit. How many human eyeballs are actually on ads? Much, much less than everyone thinks.
Yes, sure, we can build highly personalized ads. Whatever. But at the end of the day, consumers still have the exact same amount of disposable income as before. We have created Z E R O value, what we have done is consolidated it.
Hmm, what happens when markets consolidate too much? Well, I guess that would mean advertising becomes completely worthless, wouldn't it? What a conundrum! It's a good thing our markets haven't been consolidating for the past 70 years...
Do you think consumer brands lose money when they pay Google to do advertising? Do you think digital ads have a negative ROI for the brands that buy them? If so, why do they keep buying more? Wouldn’t they lose to more efficient companies?
I think you underestimate how valuable being the top slot on google is. Just the other day i googled “bluetooth speaker” and bought the first result (an ad). One hour of that can net you millions of dollars. That’s why consumer brands bid more and more every year on digital advertising.
> Do you think consumer brands lose money when they pay Google to do advertising?
For many brands, yes, and they don't know it.
> I think you underestimate how valuable being the top slot on google is.
The more you advertise, the less valuable each ad space becomes. Consumers have a lot of money they have to dole out. Giving them more ads won't increase that pot of money - it will make your cut smaller and smaller as it's split across more brands.
Which brands lose money on ads? Why are they still in business?
> consumers have a lot of money that they dole out. More ads wont increase the cut of money
Consumer spending is not a fixed pie chart or a zero sum game. US consumer spending has grown from $14 to $19 trillion since 2020. $5 trillion in new pie!!
Your model of ads is: “I, a consumer, have decided to buy a bluetooth speaker, and the ads push and pull me towards particular brands”. But that’s not how ads work! Ads don’t just compete for fixed spending, they induce NEW spending. An ad can give a customer the idea of buying, and grow the market.
> US consumer spending has grown from $14 to $19 trillion since 2020. $5 trillion in new pie!!
All that's telling you is the economy is not doing nearly as well as some of our metrics would have you believe.
Real wages are about the same as before, probably lower. Consumers are buying the same amount of stuff - no value has been created. Rather, the dollar has been devalued, much more than we're willing to let on.
There's real value, like actual physical goods, service and labor, and fake value. Fake value tries to proxy real value, but historically it's often way off.
Money is fake value. Stocks are even more fake value. It doesn't matter if your stock price is through the roof if you're not selling a product people want, for example. The product is the value, the stock price is people trying to approximate the value and future value.
Look it's fine to have contrarian opinions that left is right, everything is backwards, whatever. But when it comes to business and money, these things are quantitative and falsifiable. If you have a better understanding than the idiots in charge, then go be rich! If you have a better model for real value, you'll outcompete them. Until you do that, you are playing word games, ones which have somehow deluded you into believing that the most profitable company on earth is not valuable.
It's not even contrarian, it's just true. Money has always been a proxy for real value, which we created because real value can be hard to measure.
> If you have a better understanding than the idiots in charge, then go be rich!
Doesn't work this way because most markets are dumb as rocks.
> If you have a better model for real value, you'll outcompete them.
Doesn't work this way because most markets are dumb as rocks.
Look, after a certain point you have to detach from what you're being told and look at the world around you.
Prime example: tobacco. For humanity, Tabacoo has a negative value. You should be getting paid to smoke. Why? Because it kills you, and that's very expensive.
But that's hard to measure, right? So we just sell the cigarettes and say their value is what they're sold for. But that's not their actual value.
Their actual value, in the real world, in your hands and in your lungs, is negative. That's not an opinion. That's objective. That's just what it is.
When you look around our markets, almost all products are like this to some degree. The value we're creating is not necessarily real value.
Ads are another prime example. Do they enrich the world? Do they help consumers? No. They have zero real value. They just move money around via manipulation. That's not my opinion. That's just the objective reality.
Eventually, the real world catches up to la la land. You can't just say "well do ads and you make money". When there's no more money to move around, then even our fake value estimates of ads approach zero.
Not at all. Vanishingly few things during the development process of a novel thing have truly objective measures. The world is far too complex. We all act and exist primarily in a probabilistic environment. A subjective evaluation is not so different than simply making a prediction about how something will turn out. If your predictions based on subjective measures turn out to be more correct than others, your subjectivity is objectively better.
Hence the author's main point: a good taste is one that fits with the needs of the project. If you can't align your own presuppositions with the actualities of the work you're doing then obviously your subjective measures going forward will not be very good.
Thanks for the specifics, really fascinating list! I'm sure I'm being a bit flippant, but it's pretty funny that a list including the Playstation 1, N64, and Apple Watches is in the same conversation as systems that need to compile git from source.
Anyone know of anything on that list with more than a thousand SWE-coded users? Presumably there's at least one or two for those in the know?
What I like about seeing a project support a long list of totally irrelevant old obscure platforms (like Free Pascal does, and probably GCC) is that it gives some hope that they will support some future obscure platform that I may care about. It shows a sign of good engineering culture. If a project supports only 64-bit arm+x86 on the three currently most popular operating systems that is a red flag for future compatibility risks.
The problem is that "support" usually isn't quite the right word. In practice for obscure platforms it is often closer to "isn't known to be horribly broken". Rust at least states this explicitly with their Tier 1/2/3 system, but the same will apply to every project.
Platform support needs to be maintained. There is no way around that. Any change in the codebase has the possibility of introducing subtle platform-specific bugs. When platform support means that some teenager a decade ago got it to compile during the summer holiday and upstreamed her patches, that's not worth a lot. Proper platform support means having people actively contributing to the codebase, regularly running test suites, and making sure that the project stays functional on that platform.
On top of this, it's important to remember that platform support isn't free either. Those platform-specific patches and workarounds can and will hold back development for all the other platforms. And if a platform doesn't have a maintainer willing to contribute to keeping those up-to-date, it probably also doesn't have a developer who's doing the basic testing and bug fixing, so its support is broken anyways.
In the end, is it really such a big deal to scrap support for something which is already broken and unlikely to ever be fixed? At a certain point you're just lying to yourself about the platform being supported - isn't it better to accept reality and formally deprecate it?
In theory I agree with you, and code written in a platform-agnostic way is definitely something we should strive for, but in practice: can keeping broken code around really be called "good engineering culture"?
I don't think the concern is whether a user can compile git from source on said platform, but rather whether the rust standard lib is well supported on said platform, which is required for cross compiling.
There is no evidence to indicate this is the case. To the contrary, all evidence we have points to these models, over time, being able to perform a wider range of tasks at a higher rate of success. Whether it's GPQA, ARC-AGI or tool usage.
> they are delegating to other approaches
> Faking intelligence is not intelligence. It's just text generation.
It seems like you know something about what intelligence actually is that you're not sharing. If it walks, talks and quacks like a duck, I have to assume it's a duck[1]. Though, maybe it quacks a bit weird.
> There is no evidence to indicate this is the case
Burden of proof is on those trying to convince us to buy into the idea of LLMs as being "intelligence".
There is no evidence of the Flying Spaghetti monster or Zeus or God not existing either, but we don't take seriously the people who claim they do exist (and there isn't proof because these concepts are made up).
Why should we take seriously the tolks claiming LLMs are intelligence without proof (there can't be proof, of course, because LLMs are not intelligence)?
This question becomes difficult whenever a system becomes sufficiently complex. Take any chaotic system, like a double pendulum, and press play at step 100,000. You ask 'what is it doing'? Well, it's just applying it's rule. Step to step.
Zoom out and look at it's trajectory over those 100,00 steps and ask again.
The answer is something alien. Probabilistically it is certain the description of its behavior is not going to exist in a space we as humans can understand. Maybe if we were god beings we could say 'No no, you see the behavior of the double pendulum isn't seemingly random, you just have to look at it like this'. Encryption is a decent analogy here.
We're fooled into thinking we can understand these systems because we forced them to speak English. Under the hood is a different story.
I would note there are some known health hazards in handling thermal-paper receipts(BPA/BPS)[1] with your bare hands if you do so often. I don't know much beyond this, I would look into it.
Yes, safety of thermal paper is the first issue that comes to mind.
Secondly, IME thermal print can fade to nothing after 1-10 years. So these are specifically for short-ish-term use. Not for labeling something that is supposed to last a long time.
It's come up every time something related to thermal printing has been mentioned on HN lately, but this is honestly great stuff if you're in Germany: https://www.oekobon.de/
These non-poisonous blue receipts have the added benefit of being able to be marked with a fingernail, which is nifty if you're using them to print your shopping list, crossing things off is very satisfying.
You can also use dot matrix / impact receipt printers, they work in the same way, just with an ink reel. So no special paper needed.
They are used in kitchens where thermal paper obviously won't work. Other advantages are they can usually print two colours: black and red. And the sound is rather satisfying :-)
Right, epiphenols. And despite some BPA-free options there are many alerts about the risks of the replacements.
Maybe is time for a cool old style matrix receipt printer using regular paper?
imo it's a mistake to interpret the marginal increases in the upper echelons of benchmarks as materially marginal gains. Chess is an example. ELO narrows heavily at the top, but each ELO point carries more relative weight. This is a bit apples and oranges since chess is adversarial, but I think the point stands.
What do you mean by this? I'm assuming you're not speaking about simple absolute differences in value - there have been top players rated over 100 points higher than the average of the rest of the top ten.
The complaint about his ego is warranted, but he also earned it. Wolfram earned his PhD in particle physics from cal tech at 21 years old. Feynman was on his thesis committee. He spent time at the IAS. When he speaks about something, no matter in which configuration he chooses to do so, I am highly inclined to listen.
I have developed decent intuition on what kinds of problems Codex, Claude, Cursor(& sub-variants), Composer etc. will or will not be able to do well across different axes of speed, correctness, architectural taste, ...
If I had to reflect on why I still don't use Gemini, it's because they were late to the party and I would now have to be intentional about spending time learning yet another set of intuitions about those models.