I agree here. Formal models have to become easier to create through.
Today's ecosystem requires advanced knowledge of system design and still coding abilities.
To democratize model generation we need a more iterative and understandable way of defining intented execution. The problem is this devolves into just coding the damn thing pretty quickly.
For sure! I agree, it needs better languages, education, and tooling. It's not about making a hard problem harder; it's about making it more accessible and straight-forward to teach and use in day-to-day work.
Being more clear and precise in our specifications would only benefit us and the AI/ML tool generating the code. We could lean more on the correctness built into the entire stack rather than having to proof-read a mess of inferred code, something we're terribly ill-equipped to do.
My view is the copilot is not stealing open source code. It is learning from it just as a human reader would. People's disguste is based on the assimilation of what they thought was a human trait being machine derived from their work.
The copilot service backed by an army of actual humans wouldn’t be a story at all. Nor would anyone be angry, if an individual offered coding skills as a service, and had gone through the exercise of learning great amount to open source software to do so.
No open source license was written with this in mind. Because previously learning was something only humans could do and no one had issue with sharing that knowledge. Until licenses take machine learning use into account I see no problems with Copilot.
Source cannot be open if you restrict any viewing of it.
You aren't allowed to just read code and regurgitate it in order to claim it as your own. That is, just because you memorized this great new novel you read, it doesn't mean you can go and sit down and hammer it out and sell new copies. People go to great lengths to do this sort of things (see: clean room reverse engineering [1]) in order to try and wash themselves of liability.
If the code was purely utilitarian in nature, such as something that was optimized for execution time, there is plenty of precedent stating that the code in question is not covered by copyright.
Do an internet search for “copyright utilitarian” and read up on it if you don’t believe me!
Copyright is about protecting artistic expression which is held in contrast to the useful nature of a work.
Note: In the US, this concept is explicitly in the Copyright Act:
"In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work." (17 USC 102(b) [0]).
If you think most people pay any attention to licenses or respect them you better think again. Snippets get copied verbatim with no regard to their source all the time. Licenses have no power and are routinely ignored.
Mmm maybe rephrase that as “depending upon which entity’s copyright was violated”
Surely I don’t need to recite the last 50 years of tech legal precedent and case history for you to see that such a blanket generalization cannot be left unaddressed.
> It is learning from it just as a human reader would
I don't see how that invalidates the copyright/license argument. So, instead of just a straight up license violation it's a license violation via plagiarism.
That argument wouldn't hold up even if it was a human that caused the violation. You can't just paraphrase someones licensed work and then lie about looking at and pretend you made it yourself, which is basically what seems to happen with co-pilot, as it doesn't also automatically reproduce the license of the code it reproduces.
> You can't just paraphrase someones licensed work
Yes you can. That's exactly why you paraphrased it instead of copying verbatim.
At the fringes, your transformation may not be enough to overcome the requirements, but that's an exception. Nearly all paraphrasing is legal by default.
It learns the same way a human does by learning patterns. It is not illegal to comprehend how to accomplish tasks by reading other people's source code.
The arguments against my point always assume perfect memory of everything this model is consumed. This is the plagiarism position. In reality, some patterns are more common than others and generate a code that looks nearly identical. I can’t speak for the reasons for this, as I’m not familiar with all of the methods. However, I don’t assume that is the current working state or intent of Codex.
> It learns the same way a human does by learning patterns. It is not illegal to comprehend how to accomplish tasks by reading other people's source code.
It remains to be seen whether ML is true "learning" in the sense of developing a skill the way a human does over time.
It is however irrelevant to the manner in which this model operates today.
It isn't really learning, if it's just regurgitating whole function bodies. I use Copilot a lot, and definitely see whole functions being spit out, that were presumably written by a person somewhere.
I also use Copilot a lot, and while it does suggest large function bodies, I'm not sure that it's "regurgitating" them (though it could be...I don't know). I suspect that it's seen so many function bodies that are similar that it generates another similar output. Like autocomplete in a word processor has seen so many similar chunks of text that it reproduces them based on past experience. I don't know this as a fact, of course. I'm just reacting to the word "regurgitating."
Just did a quick github.com search on that function (with comments) and found around 131 matches. Many without a license. So yes, I believe that it would produce those comments...because it's seen humans repurpose and reuse that code without attribution or license many times.
Definitely an issue...but not as simple as copy and paste.
I definitely agree that we should find out. I've used Copilot almost since its inception, and I've seen nothing like a large copied/pasted function. If anything, it's mostly a single/double line autocomplete based on what I would have written anyway.
The Luddite reaction to copilot is very hilarious to me. It seem to be a great way to identify low-talent coders, because who else would possibly feel so threatened by an AI?… Watching HN commenters suddenly become ardent defenders of copyright is quite the sight.
I see your statement as an inversion of consensus reality. What actual coder would use copilot?
A beginner or dabbler.
I predict your attempt at tactically “managing” this copilot scandal will not play well on HN to experienced coders, your Microsoft colleagues chiming in next claiming it boosts their productivity notwithstanding.
I tried it out, but I don’t use it at all on a day-to-day basis. I have no idea if it boosts productivity or not. I also have no idea how you came up with the claim that it’s only for beginners, I’m presuming you just made this up? All of the people that I’ve noticed talking publicly about their experiences using it are highly experienced engineers.
I just think it’s hilarious how fair use is so widely supported on HN when it comes to music, or videos, or interface names, but all of a sudden is a moral crisis when it appears to threaten the value of HN member’s labor.
This is the only way to scale your impact as an IC.
The goal isn't to always have to do the repetitive reviews. It's to mentor people up so they do them for you eventually.
When you life those juniors up enough for them to mentor the next iteration you start the flywheel. You get to mentor and review people's work who do things at a higher level, and they get to learn to mentor juniors who were in their position. Everyone, even you, get to move up.
What were the limitations that required you to move all customers to a shared system at once?
Could you have selected some workspaces with lower traffic to migrate first? That would have decreased the load on the primary, potentially speeding up the replication, which is a flywheel to enable more customers to migrate to shards.
Good question, that was an option. The main motivating factor here was that vacuums were beginning to take dangerously long. O(weeks) to complete, independent of the load on the database. While migrating spaces in segments would have reduced the number of records future vacuums need to scan, we were already running against the clock to complete one vacuum prior to TXID wraparound[0]. To kick off replication for specific spaces we would have needed to write our shard key to all data owned by those spaces. That would further contribute to TXID growth, and was not something we were comfortable doing.
At the end of the day, this is something we could have explored in more depth, but we were ultimately comfortable with the risk tradeoff of migrating all users at once vs. the consequences of depending on the monolith for longer, largely thanks to the effort we put into validating our migration strategy.
Seems like something that might still be worth exploring, as if I’m thinking about this correctly, it would allow you to create new shards on the fly, and to migrate workspaces between shards while only locking one workspace at a time, and only for the amount of time required to catch up that single workspace.
Git is designed to require human oversight. This is usually a feature, but in recent years has become a bug with things like GitOps.
It's important to remember that Git is a terrible database because of its lack of semantic structure. All conflicts require a human who does have to context. This is why almost no one builds a system that uses Git as a two way interface. And when they do, its via Github Pull Requests (which go to humans) and not Git itself.
In all, this makes it a wonderful general purpose shared filesystem. And that's about it.
Estimates become less useful the further out they are. To get useful info about how long a project will take requires two large changes; careful definition of the problem and effective scoping of deliverables to validate your solution.
I push teams to understand the problem before attempting any implementations. Without this context people usually make something awesome that isn't useful. That's another thread on how.
The big hack is figuring out what can be delivered to test your assumption quickly. I shoot for about a month of work for this. Give or take.
Up front you get the team to agree, "this milestone should be easy to deliver, assuming we understand the problem and our assumptions are right". Then, if you miss that deliverable you stop work on the project and figure out why you were wrong.
This stop is meant to combat the Sunk Coat Fallacy. Then you can try a new approach, cancel the project, or keep going having only "wasted" a month. These are called Kill Metrics sometimes.
Long term estimates commonly fall to Sunk Cost issues in my experience. This is where a rush hits at the end and you get low quality product.
It takes a shift in how engineering communicates with other orgs to pull this off. You need to account for their needs in the milestones and keep them in the loop as a final delivery date comes into focus. It works to go from second half of the year -> Q4 -> Nov -> date. As long as you refine those with enough lead time.
Codespaces seems like magic. I've been using it for personal projects, and it is fantastic to do simple edits in a browser on the go.
That's not the intended use case, though. The actual value is where you get to scale up and automate what used to be only on local workstations.
The nightly builds with up-to-date dependencies and pre-pulled code get to going faster. Having a shared image encourages people to share all those local scripts that make things work where they may have just left them in a local `bin/` before.
Onboarding scripts diverge and fragment workstations because employees that have been there longer ran old versions and never got the new updates. This lets everyone use the same update to date tools together.
I'm excited to see where Github takes this. Tons of possibilities in now using Codespaces to create "local" environments composed of multiple machines.
Today's ecosystem requires advanced knowledge of system design and still coding abilities.
To democratize model generation we need a more iterative and understandable way of defining intented execution. The problem is this devolves into just coding the damn thing pretty quickly.