For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | Frieren's commentsregister

> Some people point at LLMs confabulating

No. LLMs do not confabulate they bullshit. There is a big difference. AIs do not care, cannot care, have not capacity to care about the output. String tokens in, string tokes out. Even if they have all the data perfectly recorded they will still fail to use it for a coherent output.

> Collapsing the dimensionality is going to be lossy, which means it will have gaps between what it thinks is the reality and what is.

Confabulation has to do with degradation of biological processes and information storage.

There is no equivalent in a LLM. Once the data is recorded it will be recalled exactly the same up to the bit. A LLM representation is immutable. You can download a model a 1000 times, run it for 10 years, etc. and the data is the same. The closes that you get is if you store the data in a faulty disk, but that is not why LLMs output is so awful, that would be a trivial problem to solve with current technology. (Like having a RAID and a few checksums).


I don't even think they bullshit, since that requires conscious effort that they do not an cannot possess. They just simply interpret things incorrectly sometimes, like any of us meatbags.

They make incorrect predictions of text to respond to prompts.

The neat thing about LLMs is they are very general models that can be used for lots of different things. The downside is they often make incorrect predictions, and what's worse, it isn't even very predictable to know when they make incorrect predictions.


I think this is leaning on the "lies are when you tell falsehoods on purpose; bullshit is when you simply don't care at all whether what you're saying is true" definition of bullshit. Cf. On Bullshit.

So, they can't lie, but they can (and, in fact, exclusively do) bullshit.


> No. LLMs do not confabulate they bullshit. There is a big difference. AIs do not care, cannot care, have not capacity to care about the output. String tokens in, string tokes out. Even if they have all the data perfectly recorded they will still fail to use it for a coherent output.

Isn't "caring" a necessary pre-requisite for bullshitting? One either bullshits because they care, or don't care, about the context.


They're presumably referring to the Harry Frankfurt definition of bullshit: "speech intended to persuade without regard for truth. The liar cares about the truth and attempts to hide it; the bullshitter doesn't care whether what they say is true or false."

The bullshitter does have an objective in mind however. There is some ultimate purpose to his bullshitting. LLMs don't even have that. They just spew words.

Thought of the same book when reading the above.

You seem confident. Can you get it to bullshit on GPT-5.4 thinking? Use a text prompt spanning 3-4 pages and lets see if it gets it wrong.

I haven't seen any counter examples, so you may give some examples to start with.


Here we go. Would this do?

https://chatgpt.com/share/69d6cc45-1678-8384-bd9c-0f313021ff...

The correct answer in that the U and _ in the mdstat output cannot be mapped the the rest of the output by either position or indexes in square brackets, so you can't tell the exact nature of the failure from the mdstat output alone (for the record, the failed disk was sda).

So all of the "analysis" was bullshit, including "it's probably multiple partitions from multiple drives". But there are so many juicy numbered and indexed bits of info to pattern match on!

Notice how for the followup question it "thought" for 4 minutes, going in circles trying to make essentially random ordering to make some sort of ordered sense., and then bullshited its way to "it is sdb"


> The LLM companies are not picking on me in particular, they are pounding every site on the net.

Why is not this a criminal offense? They are hurting business for profit (or for higher valuation as they probably have no profit at all).

Why are corporations allowed to do with impunity what could land even a teenager years in prison? Is there no rule of law anymore?

The five-year and ten-year penalties kick in only when the government can show the offense caused at least $5,000 in losses across all victims during a one-year period. https://legalclarity.org/what-are-the-punishments-for-a-ddos...


> Why are corporations allowed to do with impunity what could land even a teenager years in prison? Is there no rule of law anymore?

Those laws are intended to protect corporations. If corporations are the ones doing the scraping, it doesn't make sense for the same laws to affect them.


Normative vs prerogative state [1]. See US v. Swartz compared to Meta use of LibGen for Llama

[1] https://en.wikipedia.org/wiki/Dual_state_(model)


So, I knew Aaron and I definitely would not presume to predict what he would have thought, but I’d point out there is a sizeable state space where he should never have been prosecuted, and scraping by others including large commercial companies should not prosecutable on the same grounds.

I repeat what Aaron’s friends and lawyers said at the time: we were going to fight that case, and we were going to win.


His robots.txt explicitly allows bots including LLM bots to scrape his site

The LLM scraper bots ignore robots.txt

Permission to scrape is not permission to DDoS.

Because might makes right and any entity with the power to legally put up a fight is in on the game (or wants to be)

We've already established that computer crime and IP laws apply to normies and not tech companies

I have added a DB replica server just to keep my website from succumbing to AI bot traffic.

It's a bit more like a physical business with a "public welcome" policy like a coffee shop going viral and then having tens of thousands of people walking in and taking pictures but not buying coffee. It's disruptive, but not illegal.

Acme.com is welcome to require authentication for all pages but their home page, which would quickly cause the traffic to drop. They don't want to do this - like the coffee shop, they want to be open to public, and for good reasons.

Sometimes the use profile changes dramatically in a short time. 15 years ago, Netflix created the video streaming market and shared bandwidth capacity that had been excessive before wasn't enough. 15 years before that, Google did the same thing when they created search and started driving tremendous traffic to text based websites which had spread through word of mouth before.

Turns out the micro transaction people probably had the right idea.


Depends on the country. In Japan, you could be considered a "public nusicance" and be tossed behind bars for a bit.

adapt or die

waiting on the govt to do something is a path of failure


> waiting on the govt to do something is a path of failure

To keep the goverment accountable is a duty of every citizen and the only way to have a functioning society. The failure is to let the goverment be arbitrary and cater to the powerful instead of following the rule of law and applying it equally at all levels.


Because they have more money.

I've had to deploy a combination of Cloudflare's bot protection and anubis on over 200 domains across 8 different hosting environments in the last 2 months. I have small business clients that couldn't access their sales and support platforms because their websites that normally see tens of thousands of unique sessions per day are suddenly seeing over a million in an hour.

Anthropic and OpenAI were responsible for over 70% of that traffic.


> Is there no rule of law anymore?

Have you not been paying attention to the news for the past few years?

No, there isn't. If there were, Trump would be in prison, not the Oval Office. And he and the Republican Party have deliberately fostered this environment of corruption and rule-by-wealth so that they can gain more power and even more wealth.

And now they are also backing the AI zealots, and techbros more generally, to ensure that they can do whatever the hell they want, damn the consequences to the rest of the world.


Is what an offence lol? Bot scraper traffic?

How do you think search engines work?


Search engines appear to care more about being good "Netizens". It's not like GoogleBot never crashed a site, but it's rare. Search engine bots check if they need to back off for a bit, they check etags, notices if page changes infrequently and slow down their crawler frequency.

If you train an LLM, it's not like you keep a copy of every page around, so there's no point to check if you need to re-scrape a page, you do, because you store nothing.

Personally I think people would be pretty indifferent to the new generation of scrapers, AI or other types, if they at least behaved and slowed down if they notice a site struggling. If they had the slightest bit of respect for others on the web, this wouldn't be an issue.


They work because they offer ways to opt out, they honor crawl delay, setting ideal scraping times, IndexNow, etc.

And they give you real, valuable traffic in return.


Most offer ways to opt out, some don’t. Scraping somebody’s website might be annoying or problematic traffic-wise but that’s a far (very far) step removed from saying scrapers should be criminalised. The latter statement is outright laughable.

Because the law deals with intent. The intent for a 12 year old skiddie with a ddos box is to harm someone else's internet. the intent of big scrapers is to collect data. if you want to make the latter illegal then vote for that instead of loading it with the normative baggage of the former.

It's the same problem as why Occupy Wallstreet fell apart: bunch of losers who don't understand the system screech about the system. because they don't understand it, they can't offer any meaningful dialogue about how to fix it beyond screeching.


How are Swedish gangs using music platform Spotify to launder money?: https://www.euronews.com/business/2023/10/03/how-are-swedish...

You are not wrong.


> Then you need to watch comedies made decades ago.

Yes. It was nice when corporate taxes were high, xenophobia was seen as something bad, and movies could focus on smaller problems satire.

I hope that we go back to the socialist era of the USA with unionization, safety nets and welfare for the working class instead of for billionaires. Movies could just be silly again.


Was that supposed to trigger me? Not from there, but I'm in favor of "unionization, safety nets and welfare for the working class instead of for billionaires" and higher corporate taxes!

But also I don't think movies aren't silly because they deal with all the "big problems". After all they didn't have a problem making silly movies in eras with far worse problems, social and economic. And they could make hella fun movies on heavy topics just fine (Blazing Saddles and racism for example, or MASH and the Vietnam war - even if nominally about Korea).

Modern comedies aren't silly or fun, not because times are troubled, but because they're written as shallow moralizing lectures. Any "caring" is performative. They're also walking on eggshells, and are too polite to have any edge. And then there's the derivative reboots and remakes, which many of them are.


IBM has more revenue than Oracle even if we hear way less about it. 5 times smaller than Apple, thou. It also has more employees than Microsoft or Alphabet. But it has tighter profit margins than other tech companies.

IBM is not in consumer products nor services so we do not hear about it.


Oracle/TSMC/SpaceX isn’t in consumer products/services, but they are heard about.

IBM was declining for 10 years while the rest of the tech related businesses were blowing up, plus IBM does not pay well, so other than it being a business in decline, there wasn’t much to talk about. No one expects anything new from IBM.

Also, they had quite a few big boondoggles where they were the bad guys helping swindle taxpayers due to the goodwill from their brand’s legacy, so being a dying rent seeking business as opposed to a growing innovative business was the assumption I had.


SpaceX is pretty heavily in consumer products/services now that Starlink is big. But otherwise yes you are correct.

They also helped the nazis

It’s a very different company post the PwC purchase. They have around 1/3 of the revenue from consulting which tends to push the valuation down due to its relative low margin when compared to software. This also inflates the number of employees.

There are several ways of looking at law and order.

One way is that the law applies to everybody equally. That has been the way it works for many years, not perfectly, in democratic countries.

There is another way of working were the law is not blind. Laws are applied based in who is the one affected. This is what big tech and the ultra-rich have been advocating for. The law applies differently to nobility and aristocrats than to the working class.

So, for all this big tech companies the law is clear: I can copy from you, you cannot copy from me.

(That is horrifying in case that anyone needs me to spell it out)


A third way of looking at it is that you can't just blindly copy arguments when the situations are clearly different.

Nobody, not even Anthropic, is arguing that they should be able to host other people's paid content for free. The crux of their fair-use defense is that models are transformative works, just like parodies or book reviews, and hence should be treated as fair use.

You can't just take a pile of books (no pun intended) and turn that into Claude in a day with 30 lines of Python, there's a lot of work and know-how on the Anthropic side that goes into making a good LLM.


anthropic argue that you should not use claude API to train your model

Situation A - Anthropic pays for a book - Anthropic transform the book into a new llm (transformative use) -> OK

Situation B - I pay for Anthropic API - I transform API responses into a new model (transformative use) -> Not OK

the situations, are clearly the same


Anthropic goes book->llm, you do llm->llm. Very different amounts of transformativeness.

this is the most honest argument for it. i respect that.

my impression is that if open models did 'distill' claude they made some interesting and productive ideas, like deepseek's more efficient attention


...idk...both transformations use transformers... thereby they both achieve adequate levels of "transformativeness" \s

If lossy-compressed transcodes of ripped movies are not "transformative works" and can get people even jailed, then lossy-compressed text of ripped books and websites is neither.

There is a lot of knowhow going into a good divx rip too, you know.

And it enables so much novel uses such as popcorn time, with fluorishing business opportunities.

You wouldn't download a car. They did.


It’s 200 lines of python

do you really believe that? Its not just the training run, its the whole infra around it as well

it's an exaggeration for sure but I don't think it's a stretch to believe Anthropic spends considerably more effort on data scraping & curation than anything else

In other words, the law is an instrument if power.

That’s a cynical view, but unfortunately it seems true in many cases, especially for corporate law.


"there is an in-group for which the law protects but does not bind, and an out-group to which the law binds but does not protect"

> Unfortunately it is their employees that are paying the price of leadership

Neoliberalism at its finest. The world moving towards conservatism has left us with this model: The working class takes the hit of each crisis from small to big.

It is not a sustainable model.


> OpenAI correctly realized overindexong on consumer where there isn’t money is not the right way.

It says a lot about the current economy that consumers have no money. Will companies just stop making consumer products?


Consumers have always paid with data not money. That is just how we are groomed. In fact that is more valuable to companies as it turns out. Sora though doesn’t work that way, it costs the company a lot with no useful data for them. It was always a vehicle to raise the company’s image and nothing else. The only way it’s useful for them is to show the user count to investors in their next funding round. Served no other purpose, but the market changed around them.


"always" is doing a lot of work here. Just 20 years ago I think consumers largely paid with money, not personal data.


> is more valuable to companies as it turns out

Yes. I have noticed that is close to impossible to get good deals on flights, hotels, or even good discounts on-line. Sellers have all the information from consumers that they need to maximize their profit and extract the maximum amount from consumers. Dynamic pricing is making it a personalized experience, so I personally pay the maximum I possible can.

No room to get a fair price anymore.


Consumers never pay for stuff on the internet. FB, Insta, TikTok, Google products, Reddit, Snapchat. This is not a new realization that OpenAI is having.


> Somehow It shifted from users know best to "Product" knows best.

In a world where consumers have less and less power, products are designed to please CEOs.

Money is power, as inequality grows and concentrates the average user/worker/citizen has less power and their voices matter less. Today's Internet is designed for the needs of big corporations, users are there just as another product to be sold.


> they've been cloning features of their API customers and adding them to their core products since day 1

Is this not just the strategy of all platforms. Spy on all customers, see what works for them and copy the most valuable business models. Amazon does that with all kinds of products.

Platforms will just grow to own all the market and hike prices and lower quality, and pay close to nothing to employees. This is why we used to have monopoly regulations before being greedy became a virtue.


It is exactly the strategy of all platforms - they get greedy to the point of screwing over their own customers. I've lost count of number of times I've seen a platform get popular and then expand to offer the same services as its customers, often even undercutting market rates.

Just wait till they offer "Developer Certification" so you have to pay them to get a shiny little badge and a certificate while they go around saying no badge = you're shit.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You