One problem is that people trust the bot. To give you an example. If you submit articles for scientific publication you will get reviews that are at least co-authored by the bot. People defer their own judgement to what the bot says.
I told the bot I liked Steely Dan, Eagles, Bob Seger, and Roxette and asked it for music recommendations. It replied with Toto. Exasperated, I wrote "Oh, shit, you stupid bot, you don't know ANYTHING about music!"
PAP has exploited Singapore's strict libel laws to bankrupt opposition parties by suing for defamation. It is not so difficult to retain power when the opposition has no money for campaigning.
> I did, however, have a teacher who taught an advanced subject and I found his instruction so good that I did not have to bother with homework and assignments if I was happy with B grades
This comment made me roll my eyes. :) Giving students high grades for little effort is a cheat code for being considered a great teacher. Most everyone working in academia knows that.
Perhaps worth reading through the rest of the comment too? I had other teachers where it was easy to pass and get good grades (As even), but I did not call them out as good.
Before jumping to conclusions, maybe ask for context too? In particular, this was a high school for gifted math kids, and what I learned through regular classes let me pass math uni entrance exam in the top 10 (out of ~500 students) with no extra prep and even easily pass a couple of semesters of uni math with almost no prep (I took exams for two semesters after the first semester). My (lack of) working habits did catch up to me after that.
Also, for 4 years in two uni degrees, I did not get such a good teacher ever again, and there were a few who were easy to get great grades with.
Perhaps you can give some benefit of the doubt, though?
Well, I don't know you, your teacher, or even what subject we're discussing but you wrote that you did "not have to bother with homework and assignments". Perhaps you would have learned more if the instruction had not failed to make you study? :P
Whether they'd be able to make me study (it was a mathematical analysis course, though I do not think it matters) is a question that'd be hard to answer — I was already highly motivated with different topics: I was spending my time reading on CS topics like algorithm complexity, operating system development and building software interfacing with hardware. So another possible outcome would have been that I simply learned less of it due to lousy instruction — which was the case with a few other subjects.
Sometimes a teacher can also do a great job of getting you interested (also important) but I was focusing on quality instruction.
this is not necessarily the case; the coursework could have been produced by a different person from the teacher (although generally at my alma mater the 'module organiser' fulfils both roles).
I’m not sure GP’s point landed. As I read it, it had nothing to do with who created the curriculum or even how much the student learned.
“Anyone who applies the smallest amount of effort gets a B and anyone who really tries gets an A” is a path to being seen as a great teacher in the eyes of the students, especially the students who got a B.
I am not disputing that being the case in general, but it'd be nicer if they gave me more benefit of the doubt: I tried to give an honest view of actually receiving good instruction, and not enjoying being handed good grades for nothing.
I've responded to them directly what that got me (like great uni entry exam scores with literally zero prep for a maths program, and a couple of semesters of exam passing with minimal prep for a maths/CS/physics majors).
On top of that, I am talking of this almost 30 years later — perhaps I have some perspective and I am not a fresh out of school guy who just loves getting off the hook easy?
The author cites 50-year-old education studies. It's exactly like citing 50-year-old papers about cancer research. They seriously need to update their views on what the state-of-the-art in pedagogy is.
It would be really cool to declare multiple modules in the same file, somehow. Also, the Janet community's generally against the word namespace, saying we don't have them. (I don't fully grok why not.)
The `import` macro extends the current environment with prefixed symbols from another environment. But the environment is a first-class object that you can hold and manipulate and use in arbitrary ways — `require` is the lower level primitive that `import` is built on.
I see what you’re saying - index 0 holds values from 0-1, index 2 from 1-2 etc, but then you have index 255 holding values between 255 and 256. So you’re sort of arguing that the 0-255 8-bit quantization is actually representing ‘real’ values of 0-256?…
Edit: somehow missed alterom’s reply - they explain it much better than my question above does.
Not quite. I'm saying there are 256 discrete numbers (0-255) and 255 intervals between those numbers. Most of the real values will fall into the intervals and get mapped to 0-255 somehow, maybe by nearest neighbor, but I'm not trying to define how they get mapped. The point is that 255 is the largest number that can be represented with 8 bits, so you should normalize by 255.
I wrote a longer replay to alterom but it looks buried for some reason.
People taking your work and not giving anything back was ALWAYS the risk you took when writing free software. LLM training doesn't change that much. That the us military no doubt is using gcc to compile embedded software for their icbm:s no doubt irks the gnu people. But you can't have it any other way. "You can only use my software for good things" just is not consistent with "free software".
Yeah, I really can't comprehend these sentiments as anything other than an "I don't like AI" argument. FOSS has always been about just writing code and putting it out into the world where others can do as they please with it.
I see a lot of risks involved in people surrendering their own decision-making to LLMs, but that's a question of how they're used, not how they're trained. The idea that using FOSS software to train LLMs is somehow a violation of FOSS norms just doesn't seem valid.
That's just the licensing part. The license says something, but a license doesn't turn people into slaves. The desire or decision to produce software has to come first and only then does code with a license exist.
Before AI and in the early days of FOSS, people assumed that the primary recipient of code sharing were other FOSS enthusiasts, in the form of developers and users.
Then there was a wave of permissive licensing, which obviously brought with it corporate interests, however, this was easily foreseeable and many people who favored permissive licensing intentionally did so to appeal to corporate users, so the risk of them quitting due to perceived abuse was slim.
Now that LLMs are a thing, the primary recipient of a lone developer working on his project isn't really another human being. This human connection is now lost. Instead, your project is now laundered through the model and the model vendor can get away with ignoring your terms and conditions and let others write proprietary software.
In this transition period there were developers who thought that there was always going to be a human connection (even if part of a corporation), but then things changed and they realized their world view was wrong. Given the arrival of this new information, they obviously change their behavior in accordance to how the world actually is.
> FOSS has always been about just writing code and putting it out into the world where others can do as they please with it.
That is wrong. How can you write that with a straight face? There are projects that are put into the public domain (one major one comes to mind), but the clear majority of FOSS projects have strings attached which make the intention of the authors absolutely clear.
IOW, if you're not happy with what the cost of the product is, then just don't use it.
I mean, the most restrictive license, the GPL, was conceived specifically to protect the "four freedoms" and prevent subsequent modifications from violating them. The "copyleft" concept was specifically designed to create an ecosystem that behaved as if copyright didn't apply in the first place.
I don't know how you can imply with a straight face that it did anything else.
I don't know how you can possibly argue that non-redistributive usage of software could ever violate the GPL -- and the other common FOSS licenses don't even have the copyleft provision, and literally are saying "do whatever you want, but I'm not responsible".
> that behaved as if copyright didn't apply in the first place.
If copyright didn't exist then the share-alike and anti-tivoization clauses wouldn't work, FOSS in general wouldn't even protect attribution. Copyleft ecosystems depend on some amount of copyright law to uphold themselves.
> The "copyleft" concept was specifically designed to create an ecosystem that behaved as if copyright didn't apply in the first place.
And if copyright didn't exist in the first place we wouldn't be having this conversation, because the models created by all the token providers will be open to all for whatever use that anyone wanted.
But it does exist, and within this framework, the creator gets to say how you may redistribute their IP, and "We compressed it very much" isn't an out.
> But it does exist, and within this framework, the creator gets to say how you may redistribute their IP,
Right. And the way the creator gets to exercise that say is by releasing their work under a license. If you release your work under a FOSS license, you're saying "you are free to copy this work and use it for your own purposes".
Complaining that people are using it for purposes you don't like after you've already given permission to them to use it for whatever purposes they please seems a bit disingenuous.
> and "We compressed it very much" isn't an out.
It's not, but I don't think we're discussing that. We're talking about LLMs, not people redistributing zip files containing someone else's work. If you're trying to imply that LLMs are merely a form of compression, that's a position you've got to argue for, because I'm definitely not seeing any similarity between the two.
Sure, but I guess I'm not seeing the relevance here. Are we seeing some greater-than-normal wave of people redistributing FOSS code without attribution, or creating derivative works without adhering to the license terms? LLM training doesn't seem to be either of these things.
Can you point to some specific examples of products shipped by the companies I assume you're referring to here that are in fact unattributed derivative works of GPL-licensed software?
Or are you saying that you think anything generated by an LLM qualifies as a derivative work of anything included in its training data?
> It's a tool, if using data is necessary to make the tool work, then its output derives from the data.
That's simply not correct within the applicable meaning of "derives" as understood in copyright law. In fact, data per se is not even within the scope of copyright protection in the first place: specific published works are copyrighted, but the underlying ideas and facts that they convey are not.
Even creating works that merely draw on a single source of data, but express the ideas drawn from that in a new or transformative way, are not considered derivative works (see the ruling in Google v. Oracle, for example), let alone works based on patterns extrapolated by relating together ideas sourced from many distinct works, which is what LLMs are principally doing.
If you applied the principle you're proposing here to human developers, you'd conclude that any code written by someone who learned to program by studying techniques used in FOSS software would in turn be a derivative work of that software. No one has ever regarded this to be the case.
> That's simply not correct within the applicable meaning of "derives" as understood in copyright law.
Would be rather hard to write a definition that handles it properly back when LLMs didn't exist; not that laws particularly have anything to do with intent/desires behind FOSS anyway - intent is clearly there: you get code, under the condition that if you use it for anything, I get credited; else, you get nothing.
> In fact, data per se is not even within the scope of copyright protection in the first place: specific published works are copyrighted, but the underlying ideas and facts that they convey are not.
Luckily, FOSS is specific published works, and unless LLMs actually reasonably-provably do such decomposing into ideas/facts (good luck reasoning about that), that part is also irrelevant.
> If you applied the principle you're proposing here to human developers, you'd conclude that any code written by someone who learned to program by studying techniques used in FOSS software would in turn be a derivative work of that software. No one has ever regarded this to be the case.
Depending on intent, that very much can happen, it's called plagiarism. Good luck proving an LLMs intent. (not to mention the obvious differentiating factor of LLMs having arbitrarily-good memory unlike humans)
> under the condition that if you use it for anything, I get credited; else, you get nothing.
But this has never been a condition in the FOSS world, as far as I'm aware. I've only ever seen attribution requirements attach to redistribution of source, not usage of the software.
I understand that the crux of the debate here is whether training an LLM is redistribution of the underlying code, but to me, it seems to be fairly clear that it is not.
> Luckily, FOSS is specific published works, and unless LLMs actually reasonably-provably do such decomposing into ideas/facts (good luck reasoning about that), that part is also irrelevant.
That's literally all LLMs do. That's what tokenization is. And it's trivially provable, since if you compare LLM models with the copyrighted works you're claiming they replicate, all you'll see on the LLM side is probability matrices representing correlations between decomposed units of knowledge aggregated across the entire dataset as an integrated whole.
> Depending on intent, that very much can happen, it's called plagiarism. Good luck proving an LLMs intent.
The only intent ever in play is that of the user. LLMs are just software.
> But this has never been a condition in the FOSS world, as far as I'm aware. I've only ever seen attribution requirements attach to redistribution of source, not usage of the software.
AGPL requires that even users using the software even across a network must be provided with a way to get the license (i.e. attribution) and source. Never mind that LLMs consume the source code instead of "using" the software anyway. (and of course things go more downhill for LLMs for licenses more restrictive than AGPL)
Otherwise, I'd say that, for many, the ideal condition for (copyleft) FOSS would be that anything that utilizes source code in any form also provides said source code and license/attribution. Sometimes that can even extend to outputs of software (and e.g. gcc takes time to explicitly state that its compiled code output does not count as being derived from gcc's code).
> whether training an LLM is redistribution of the underlying code
There's a funky side-note of whether LLM training can even be done on material with improperly-followed licensing; if you don't even have the permission to modify the material (as properly following MIT/GPL/etc would give you), it might be illegal to even tokenize it, never mind use it for training.
> That's literally all LLMs do. That's what tokenization is.
It's clearly not that simple, otherwise "split source into 10-char chunks, reverse that list, reverse it back, join this fun list we've gotten" would be enough to circumvent copyright.
> all you'll see on the LLM side is probability matrices representing correlations between decomposed units of knowledge aggregated across the entire dataset as an integrated whole.
Yeah, you need at least that, tokenization is irrelevant. But jury's out on this one - of course a good chunk is some form of "abstract knowledge", but other parts could be just encoding material in some compressed form (and surely gzipping a source code file doesn't circumvent copyright) that at the very least can apply to weights.
> The only intent ever in play is that of the user. LLMs are just software.
So my split-into-words-and-join-back is valid circumvention of copyright, if the user of some software doing that isn't informed that it's just effectively directly copying material. (I'll grant that perhaps, in such, the accidental-infringer might get a smaller penalty and/or get to defer punishment to whoever mismarketed the software to them,...but that wouldn't apply to anyone who knows that LLMs are very much just directly trained on copyrighted material. Don't know about legally derived, but surely mathematically derived)
Never mind that, for some things, learning some specific copyrighted code is the desired thing (humans do do this after all!), at which point at the very least the weights of the model are as copyright-infused as a gzipped source code file is.
If intent determination is on the user, and the user is aware that LLMs are very much technically capable of producing copyrighted works to some extent (which they better be), it would be on the user to ensure that any specific code they end up using is not, which is...a rather non-trivial task (a human that writes code can also reasonably-reason about whether they're infringing on whatever they learned from, but splitting into LLM writing + human checking fundamentally makes that basically infeasible).
There's an almost intergalactic level of irony in the extent to which open source has benefited giant corporations and the military at the expense of individuals, and ultimately contributed to the commercialised enclosure of software IP.
I suppose you could argue it also indirectly led to the empowerment of non-developers to create their own vibe coded solutions. But we're not quite there yet.
And the AI IP that makes that possible is still enclosed rather than open.
Sure, Free Software hasn't been the vehicle for societal change that RMS and others certainly hoped. I remember being flamed out in a user group for suggesting that our conference shouldn't be held in a "non-free" country such as Morocco, Turkey, or China because it's counter-productive to freedom. Very few people actually got it. But it's orthogonal to LLM trainers also using free software in "non-approved" ways.
> There's an almost intergalactic level of irony in the extent to which open source has benefited giant corporations and the military at the expense of individuals, and ultimately contributed to the commercialised enclosure of software IP.
Could you perhaps explain that irony a bit more explicitly?
Can you provide any examples of "commercialized enclosure of software IP" somehow backwashing into the FOSS ecosystem and closing things up that are already open?
Before LLMs, you could use the GNU GPL or other copyleft licenses to protect your code from being used to develop non-free software. Unfortunately, the courts have decided that LLMs are free to ignore licenses.
reply