I think it is important to try to find more rigorous things to test than the general sentiment of the people using the tools. If only because the more benchmarks we have the more we can improve models without regressions. METR is asking a really interesting question here, "are models improving at making one shot PRs?". The answer seems to be, yes, but slower than benchmarks suggest, if you look at the pass rate of different versions of Claude Sonnet. A reasonable answer is "you're not supposed to use them by making one shot PRs", but then ideally we would need to have some kind of standarized test for the ability of models to incorporate feedback and evolve PRs.
>To study how agent success on benchmark tasks relates to real-world usefulness, we had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests (PRs). We had maintainers (hypothetically) accept or request changes for patches as well as provide the core reason they were requesting changes: core functionality failure, patch breaks other code or code quality issues.
I would also advise taking a look at the rejection reasons for the PRs. For example, Figure 5 shows two rejections for "code quality" because of (and I quote) "looks like a useless AI slop comment." This is something models still do, but that is also very easily fixable. I think in that case the issue is that the level of comment wanted hasn't been properly formalized in the repo and the model hasn't been able to deduce it from the context it had.
As for the article, I think mixing all models together doesn't make sense. For example, maybe a slope describe the increasing Claude Sonnet better than a step function.
That is true but also a bit unfair, they've also been oddly preoccupied with topics like trying to help the most people and frequently promote giving money to efficient charities to fight against malaria, vitamin A deficiencies and help vaccinate children in very poor countries.
This image comes from running the different versions of the benchmark games programs. Some of the difference between languages may actually be just algorithmic differences, and also those programs are in general not representative of most of the software that runs.
I have no tolerance for bystanders being killed in general. If the science experiments kill on average less bystanders I'm all for them, if they don't they should be stopped until made safer.
In this case the judgement is so extreme because the judge had no tolerance for Tesla lying in relation to the server logs' existence and what they contained (namely that is was indeed their autopilot that was in full control, had been in full control for almost half an hour, and was not worried at all/not issuing warnings, at the time of the crash)
HathiTrust (https://en.wikipedia.org/wiki/HathiTrust) has 6.7 millions of volumes in the public domain, in PDF from what I understand. That would be around a billion pages, if we consider a volume is ~200 pages. 5000 days to go through that with an A100-40G at 200k pages a day. That is one way to interpret what they say as being legal. I don't have any information on what happens at DeepSeek so I can't say if it's true or not.
Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?
>Last semester, professor Pamela Newton, who also teaches the course, allowed students to bring readings either on tablets or in printed form. While laptops felt like a “wall” in class, Newton said, students could use iPads to annotate readings and lie them flat on the table during discussions. However, Newton said she felt “paranoid” that students could be texting during class.
>This semester, Newton has removed the option to bring iPads to class, except for accessibility needs, as a part of the general movement in the “Reading and Writing the Modern Essay” seminars to “swim against the tide of AI use,” reduce “the infiltration of tech,” and “go back to pen and paper,” she said.
Is this about teaching efficiency or managing the teacher's feelings? If "the infiltration of tech" allowed for better learning, would this teacher even be open to it?