More

Zababa · 2026-03-12T13:17:56 1773321476

I think it is important to try to find more rigorous things to test than the general sentiment of the people using the tools. If only because the more benchmarks we have the more we can improve models without regressions. METR is asking a really interesting question here, "are models improving at making one shot PRs?". The answer seems to be, yes, but slower than benchmarks suggest, if you look at the pass rate of different versions of Claude Sonnet. A reasonable answer is "you're not supposed to use them by making one shot PRs", but then ideally we would need to have some kind of standarized test for the ability of models to incorporate feedback and evolve PRs.

Zababa · 2026-03-12T13:11:55 1773321115

From the METR study (https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs...):

>To study how agent success on benchmark tasks relates to real-world usefulness, we had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests (PRs). We had maintainers (hypothetically) accept or request changes for patches as well as provide the core reason they were requesting changes: core functionality failure, patch breaks other code or code quality issues.

I would also advise taking a look at the rejection reasons for the PRs. For example, Figure 5 shows two rejections for "code quality" because of (and I quote) "looks like a useless AI slop comment." This is something models still do, but that is also very easily fixable. I think in that case the issue is that the level of comment wanted hasn't been properly formalized in the repo and the model hasn't been able to deduce it from the context it had.

As for the article, I think mixing all models together doesn't make sense. For example, maybe a slope describe the increasing Claude Sonnet better than a step function.

Zababa · 2026-03-12T11:03:25 1773313405

> surely the solution to fast fashion is just to not buy and throw away so many clothes?

"just don't do X" has basically never worked, it is not a serious solution to any problem.

Zababa · 2026-03-10T09:44:39 1773135879

That is true but also a bit unfair, they've also been oddly preoccupied with topics like trying to help the most people and frequently promote giving money to efficient charities to fight against malaria, vitamin A deficiencies and help vaccinate children in very poor countries.

yifanl · 2026-03-10T10:22:13 1773138133

That's their marketing pitch, but revealed preferences are stronger signals than stated ones.

Zababa · 2026-03-10T12:26:36 1773145596

I agree that revealed preferences are stronger signals than stated ones. https://funds.effectivealtruism.org/ shows 52000 donors for $110M, https://www.givingwhatwecan.org/ says more than 10000 donors and more than $490M given.

Zababa · 2026-02-23T13:43:27 1771854207

This image comes from running the different versions of the benchmark games programs. Some of the difference between languages may actually be just algorithmic differences, and also those programs are in general not representative of most of the software that runs.

Zababa · 2026-02-20T20:36:02 1771619762

I have no tolerance for bystanders being killed in general. If the science experiments kill on average less bystanders I'm all for them, if they don't they should be stopped until made safer.

spwa4 · 2026-02-21T10:13:30 1771668810

In this case the judgement is so extreme because the judge had no tolerance for Tesla lying in relation to the server logs' existence and what they contained (namely that is was indeed their autopilot that was in full control, had been in full control for almost half an hour, and was not worried at all/not issuing warnings, at the time of the crash)

Zababa · 2026-02-16T14:56:24 1771253784

HathiTrust (https://en.wikipedia.org/wiki/HathiTrust) has 6.7 millions of volumes in the public domain, in PDF from what I understand. That would be around a billion pages, if we consider a volume is ~200 pages. 5000 days to go through that with an A100-40G at 200k pages a day. That is one way to interpret what they say as being legal. I don't have any information on what happens at DeepSeek so I can't say if it's true or not.

Zababa · 2026-02-16T14:48:45 1771253325

Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?

Zababa · 2026-02-02T13:09:30 1770037770

>Sure, you can have my little assessment at the end if you like, but I work for the students, not for the companies.

Most of the students are here because they want to be in the companies, not for the joy of learning.

Zababa · 2026-02-02T13:06:58 1770037618

>Last semester, professor Pamela Newton, who also teaches the course, allowed students to bring readings either on tablets or in printed form. While laptops felt like a “wall” in class, Newton said, students could use iPads to annotate readings and lie them flat on the table during discussions. However, Newton said she felt “paranoid” that students could be texting during class.

>This semester, Newton has removed the option to bring iPads to class, except for accessibility needs, as a part of the general movement in the “Reading and Writing the Modern Essay” seminars to “swim against the tide of AI use,” reduce “the infiltration of tech,” and “go back to pen and paper,” she said.

Is this about teaching efficiency or managing the teacher's feelings? If "the infiltration of tech" allowed for better learning, would this teacher even be open to it?

HN For You