For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | Eridrus's commentsregister

The article is just helpfully illustrating how artisanal you can make your slop if you really try!

How are people building anything without evals?

Maybe I spent too much time in the ML mines, but it is somewhat inconceivable to iterate on a tricky problem without a eval set.


Beyond this, you are what you do.

And you are what you do for other people.

Besides providing support and entertainment for our friends and families, the concrete things we do that bring value to society are through our jobs.

Society doesn't run on hanging out or hobbies.


Cursor have said they are using Composer through their inference provider (Fireworks). Presumably the MIT is not viral like the GPL, so Cursor, and companies that use Cursor do not need to display Kimi attribution on their products.

It's definitely not what Kimi wanted, but it sounds like this is what is written.


Unrelated to FSD, what's a good example where frontier AI struggles with logical thinking that even stupid humans can figure out?

I personally feel like that isn't really true any more.


The recent one was should I drive my car to the car wash if it's only 300 feet from my house although it wasn't a slam dunk.


Right, but if these things are so rare that we all only know the one viral example, I feel like that lends credence to the models basically generally not having this problem.

Researchers built the Winnograd Schema Challenge more than a decade ago to assess common sense reasoning, and LLMs beat that challenge task around GPT 4.


They're not so rare. Hallucinations have been spotted everywhere, but the "driving a car to the car wash" is an amusing one that's been recently publicised. Developers aren't going to point out every time an LLM hallucinates an entire library.


I'd add to this, any moderately involved logical or numerical problem causes hallucinations for me on all frontier models.

If you ask them in isolation they may write a script to solve it "properly", but I guess this is because they added enough of these to the training set. But this workaround doesn't scale.

As soon as I give the LLM a proper problem and a small part of it requires numeric reasoning, it almost always hallucinates something and doesn't solve it with a script.

If the logic/math is part of a larger problem the miss rate is near 100%.

LLMs have massive amounts of knowledge, encoded in verbal intelligence, but their logic intelligence is well below even average human intelligence.

If you look at how they work (tokenization and embeddings) it's clear that transformers will not solve the issue. The escape hatches only work very unreliably.


What's a typical example?

I have been broadly quite happy with gpt 5.4 xhigh's reasoning on things like performance engineering tasks.


If you ask this of any current day AI it will answer exactly how you would expect. Telling you to drive, and acknowledging the comedic nature of the question.


That's because AI labs keep stamping out the widely known failures. I assume without actually retraining the main model, but with some small classifier that detects the known meme questions and injects correct answer in the context.

But try asking your favorite LLM what happens if you're holding a pen with two hands (one at each end) and let go of one end.



Are you also an LLM? Do objects often begin rotating when you're only holding them with one hand?


Not unlikely that you're talking to a lot of AI-based AI boosters. It's easier to create astroturfed comments with chatbots than fixing the inherent problems.


I always like to ask AI to generate a middle aged blond man with gray hair. Turns out that all models with gray have black roots.

https://chatgpt.com/share/69bcd01a-a750-800d-95f5-3b840b9ee2...

https://gemini.google.com/share/edc223bb6291 (the try again gave a woman, oops)

Even Midjourney couldn't do it.


Nice. My test was always a blond bald guy. It always adds hair. If you ask for bald you get a dark haired bald guy, if you add blond, you can't get bald because I guess saying the hair color implies hair (on the head), while you may just want blonde eyebrows and/or blond stubble.


It's not horrifically slow.


I think plenty of software is a pile of shit and still derive value from it.


Exactly, better the pile of shit you know than the pile of shit you don’t know - or the pile of shit that is u knowable.


Yeah I'd go so far as to say that most useful software is "bad" in some way.


Worse is Better


This too will be solved. You can get tye frontier models from AWS/Google/Azure without needing to send your data to anyone else already.


Companies need databases lol.

I don't know how you think a b2b company could run sales without a CRM like Salesforce.

To give your question a generous interpretation, Salesforce is more valuable than Apptio or your home grown CRM because it already has all the features any sales org needs, and all the fragmented sales and marketing tooling are already integrated with it.

And Sales is a very expensive and also high ROI activity. You don't want your sales team hung up trying to figure out how to get the random CRM to do something. You're not looking to cut costs in this area, you're looking to enhance the overall productivity of the org. Sales tooling overall is very expensive for this reason, any marginal edge is worth a lot.

It's also worth noting that a big value of things like Salesforce is that it lets management check up on what people are doing, because as much as HN doesn't like to admit it, people are often not very careful or diligent, and you need to perform supervision on the vast majority of people to improve their performance.

Jira is similar, in that eng is very expensive, and its probably better than what these companies were doing beforehand, even if it is suboptimal.


It's true, literally no b2b sales companies existed before Salesforce. We must all continue to pay for Salesforce and support its workflows for now until the endless future, lest b2b sales vanish again.


Lol, nice straw man there.


This mostly just sounds like a poison pill that commercial entities wouldn't use, and if you want that you can already use AGPL.

Especially as the cost of producing code drops, the value of libraries decreases.


> Especially as the cost of producing code drops, the value of libraries decreases.

Does it? If the cost of slop that (1) no one understands, and (2) no one can be sued for if it misbehaves drops to zero, what have we gained? A "library" is code plus reliability and accountability. (Yes, GPL disclaims liability, but that's why consultants exist.)


Reliability is important for sure, but as you noted, there is no accountability for library maintainers.

I'm not saying all libraries will go to zero values, just that their value is decreasing.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You