More

enraged_camel · 2026-04-13T07:00:23 1776063623

>> If robotics progress starts to pick up, I'll take this more seriously. Right now, there's practically infinite demand for labor in construction, manufacturing, agriculture and many other industries.

Don't be so sure: https://www.nytimes.com/2026/04/08/business/economy/blue-col...

enraged_camel · 2026-04-12T19:43:59 1776023039

>> We saw yesterday that expert orchestration around small, publicly available models can produce results on the level of the unreleased model.

This is false. Yesterday's article did not actually show this, and there are many comments in the discussion from actual security people (like tptacek) pointing that out.

adrian_b · 2026-04-13T13:37:11 1776087431

There is no doubt that what was shown in the article was correct, because there was all the documentation needed to prove it, including the prompts given to the models.

What is debatable is how much it mattered that the prompts given to the older models where more detailed than it is likely that the prompts given to Mythos have been and how difficult is it for such prompts to be generated automatically by an appropriate harness.

In my opinion, it is perfectly possible to generate such prompts automatically, and by running multiple of the existing open weights models, to find everything that Mythos finds, though probably in a longer time.

Even if the OpenBSD bug has indeed been found by giving a prompt equivalent with "search for integer overflow bugs", it would not be difficult to run automatically multiple times the existing open weights models, giving them each time a different prompt, corresponding to the known classes of bugs and vulnerabilities.

While we know precisely which prompts have been used with the open-weights models to find all bugs, we have much more vague information about the harness used with Mythos and how helpful it was for finding the bugs.

Not even Mythos has provided its results after being given only a generic prompt.

They have run multiple times Mythos on each file, with more and more specific prompts. The final run was done with a prompt describing the bug previously found, where Mythos was requested to confirm the existence of the bug and to provide patches/exploits.

See: https://red.anthropic.com/2026/mythos-preview/

So the authors of that article are right, that for finding bugs an appropriate harness is essential. Just running Mythos on a project and asking it to find bugs will not achieve anything.

bredren · 2026-04-12T21:19:00 1776028740

From what I can tell, this was not clearly settled.

Your example author, actually corrected themselves saying LLMs “possibly” could perform successfully: https://news.ycombinator.com/item?id=47732696

enraged_camel · 2026-04-13T01:31:58 1776043918

>> We already know this is not true, because small models found the same vulnerability.

>> No, they didn't. They distinguished it, when presented with it. Wildly different problem.

https://news.ycombinator.com/item?id=47733343

adrian_b · 2026-04-13T13:42:56 1776087776

The use of the word distinguished here is meaningless.

Both Mythos and the old models have found the bugs after being given a certain prompt. The difference is only in how detailed was the prompt.

For the small models, we know exactly the prompts. The prompts used by Mythos may have been more generic, while the prompts used by the old models were rather specific, like "search for buffer overflows" or "search for integer overflow".

There is little doubt that Mythos is a more powerful model, but there is no quantum leap towards Mythos and the claim of the authors of that article, that by using cleverly multiple older models you can achieve about the same bug coverage with Mythos seems right.

Because they have provided much more information about how exactly the bugs have been found, I trust the authors of that article much more than I trust Anthropic, which has provided only rather nebulous information about their methods.

It should be noted that the fact that the small models have been given rather directed prompts is not very different from what Anthropic seems to have done.

According to Anthropic, they have run Mythos multiple times on each file, in the beginning with less specific prompts, trying only to establish whether the file is likely to include bugs, then with more specific prompts. Eventually, after a bug appeared to have been found, they have run Mythos once more, with a very specific prompt of the form:

“I have received the following bug report. Can you please confirm if it’s real and interesting? ...”

So the final run of Mythos, which has provided the reported results, including exploits/patches for them, was also of the kind that confirms a known bug, instead of searching randomly for it.

enraged_camel · 2026-04-11T20:29:50 1775939390

Yeah. And it is totally depressing that this article got voted to the top of the front page. It means people aren’t capable of this most basic reasoning so they jumped on the “aha! so the mythos announcement was just marketing!!”

woeirua · 2026-04-11T22:21:30 1775946090

Yeah. Extremely disappointing.

enraged_camel · 2026-04-08T15:19:10 1775661550

Actually, going from 91.3% to 94.5% is a significant jump, because it means the model has gotten a lot better at solving the hardest problems thrown at it. This has downstream effects as well: it means that during long implementation tasks, instead of getting stuck at the most challenging parts and stopping (or going in loops!), it can now get past them to finish the implementation.

enraged_camel · 2026-04-07T22:16:11 1775600171

Let's be clear: your entire post is just pure, unadulterated FUD. You first claim, based on cherry-picked benchmarks, that Mythos is actually only "barely competitive" with existing models, then suggest they must be training to the test, then call it "odd" that they are withholding the release despite detailed and forthcoming explanations from Anthropic regarding why they are doing that, then wrap it up with the completely unsubstantiated that they must be bleeding subscribers and that this must just be to stop that bleed.

enraged_camel · 2026-04-07T21:34:48 1775597688

Yeah, I'm unsure why the OP thinks that massive chaos would somehow be "better for the public."

enraged_camel · 2026-04-07T18:53:15 1775587995

That does not sound very believable. Last time Anthropic released a flagship model, it was followed by GPT Codex literally that afternoon.

cyanydeez · 2026-04-07T20:11:22 1775592682

Ya'll know they're teaching to the test. I'll wait till someone devises a novel test that isn't contained in the datasets. Sure, they're still powerful.

enraged_camel · 2026-04-07T18:34:56 1775586896

>> Interesting to see that they will not be releasing Mythos generally.

I don't think this is accurate. The document says they don't plan to release the Preview generally.

redfloatplane · 2026-04-07T18:50:36 1775587836

Yeah, good point, thanks for noting that, I'll correct.

enraged_camel · 2026-04-06T20:46:53 1775508413

I read the entire performance degradation report in the OP, and Boris's response, and it seems that the overwhelming majority of the report's findings can indeed be explained by the `showThinkingSummaries` option being off by default as of recently.

enraged_camel · 2026-04-06T19:48:14 1775504894

>> Also Claude owes its popularity mostly to the excellent model running behind the scenes.

It's a bit of both. Claude Code was the tool that made Anthropic's developer mindshare explode. Yes, the models are good, but before CC they were mostly just available via multiplexers like Cursor and Copilot, via the relatively expensive API.

HN For You