>> If robotics progress starts to pick up, I'll take this more seriously. Right now, there's practically infinite demand for labor in construction, manufacturing, agriculture and many other industries.
>> We saw yesterday that expert orchestration around small, publicly available models can produce results on the level of the unreleased model.
This is false. Yesterday's article did not actually show this, and there are many comments in the discussion from actual security people (like tptacek) pointing that out.
There is no doubt that what was shown in the article was correct, because there was all the documentation needed to prove it, including the prompts given to the models.
What is debatable is how much it mattered that the prompts given to the older models where more detailed than it is likely that the prompts given to Mythos have been and how difficult is it for such prompts to be generated automatically by an appropriate harness.
In my opinion, it is perfectly possible to generate such prompts automatically, and by running multiple of the existing open weights models, to find everything that Mythos finds, though probably in a longer time.
Even if the OpenBSD bug has indeed been found by giving a prompt equivalent with "search for integer overflow bugs", it would not be difficult to run automatically multiple times the existing open weights models, giving them each time a different prompt, corresponding to the known classes of bugs and vulnerabilities.
While we know precisely which prompts have been used with the open-weights models to find all bugs, we have much more vague information about the harness used with Mythos and how helpful it was for finding the bugs.
Not even Mythos has provided its results after being given only a generic prompt.
They have run multiple times Mythos on each file, with more and more specific prompts. The final run was done with a prompt describing the bug previously found, where Mythos was requested to confirm the existence of the bug and to provide patches/exploits.
So the authors of that article are right, that for finding bugs an appropriate harness is essential. Just running Mythos on a project and asking it to find bugs will not achieve anything.
The use of the word distinguished here is meaningless.
Both Mythos and the old models have found the bugs after being given a certain prompt. The difference is only in how detailed was the prompt.
For the small models, we know exactly the prompts. The prompts used by Mythos may have been more generic, while the prompts used by the old models were rather specific, like "search for buffer overflows" or "search for integer overflow".
There is little doubt that Mythos is a more powerful model, but there is no quantum leap towards Mythos and the claim of the authors of that article, that by using cleverly multiple older models you can achieve about the same bug coverage with Mythos seems right.
Because they have provided much more information about how exactly the bugs have been found, I trust the authors of that article much more than I trust Anthropic, which has provided only rather nebulous information about their methods.
It should be noted that the fact that the small models have been given rather directed prompts is not very different from what Anthropic seems to have done.
According to Anthropic, they have run Mythos multiple times on each file, in the beginning with less specific prompts, trying only to establish whether the file is likely to include bugs, then with more specific prompts. Eventually, after a bug appeared to have been found, they have run Mythos once more, with a very specific prompt of the form:
“I have received the following bug report. Can you please confirm if it’s real and interesting? ...”
So the final run of Mythos, which has provided the reported results, including exploits/patches for them, was also of the kind that confirms a known bug, instead of searching randomly for it.
Yeah. And it is totally depressing that this article got voted to the top of the front page. It means people aren’t capable of this most basic reasoning so they jumped on the “aha! so the mythos announcement was just marketing!!”
Actually, going from 91.3% to 94.5% is a significant jump, because it means the model has gotten a lot better at solving the hardest problems thrown at it. This has downstream effects as well: it means that during long implementation tasks, instead of getting stuck at the most challenging parts and stopping (or going in loops!), it can now get past them to finish the implementation.
Let's be clear: your entire post is just pure, unadulterated FUD. You first claim, based on cherry-picked benchmarks, that Mythos is actually only "barely competitive" with existing models, then suggest they must be training to the test, then call it "odd" that they are withholding the release despite detailed and forthcoming explanations from Anthropic regarding why they are doing that, then wrap it up with the completely unsubstantiated that they must be bleeding subscribers and that this must just be to stop that bleed.
Ya'll know they're teaching to the test. I'll wait till someone devises a novel test that isn't contained in the datasets. Sure, they're still powerful.
I read the entire performance degradation report in the OP, and Boris's response, and it seems that the overwhelming majority of the report's findings can indeed be explained by the `showThinkingSummaries` option being off by default as of recently.
>> Also Claude owes its popularity mostly to the excellent model running behind the scenes.
It's a bit of both. Claude Code was the tool that made Anthropic's developer mindshare explode. Yes, the models are good, but before CC they were mostly just available via multiplexers like Cursor and Copilot, via the relatively expensive API.
Don't be so sure: https://www.nytimes.com/2026/04/08/business/economy/blue-col...
reply