This is from the first of the caveats that they list:
> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints. The models' performance here is an upper bound on what they'd achieve in a fully autonomous scan. That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do.
That's why their point is what the subheadline says, that the moat is the system, not the model.
Everybody so far here seems to be misunderstanding the point they are making.
If that's the point they are making, let's see their false positive rate that it produces on the entire codebase.
They measured false negatives on a handful of cases, but that is not enough to hint at the system you suggest. And based on my experiences with $$$ focused eval products that you can buy right now, e.g. greptile, the false positive rate will be so high that it won't be useful to do full codebase scans this way.
How do we know the false positives for this "Mythos" thingamabob? Since they didn't release it, and we cannot reproduce it, are we to simply believe their word on this? What if the author of the featured article simply made a claim about that? We also simply believe their word? To me these AI tech companies are not any more trustworthy than a random blog author, maybe even less so, due to all the shady stuff they are pulling and especially since they have not released. Show or it didn't happen.
I get what you're saying, but I think this is still missing something pretty critical.
The smaller models can recognize the bug when they're looking right at it, that seems to be verified. And with AISLE's approach you can iteratively feed the models one segment at a time cheaply. But if a bug spans multiple segments, the small model doesn't have the breadth of context to understand those segments in composite.
The advantage of the larger model is that it can retain more context and potentially find bugs that require more code context than one segment at a time.
That said, the bugs showcased in the mythos paper all seemed to be shallow bugs that start and end in a single input segment, which is why AISLE was able to find them. But having more context in the window theoretically puts less shallow bugs within range for the model.
I think the point they are making, that the model doesn't matter as much as the harness, stands for shallow bugs but not for vulnerability discovery in general.
OK, consider a for loop that goes through your repo, then goes through each file, and then goes through each common vulnerability...
Is Mythos some how more powerful than just a recursive foreloop aka, "agentic" review. You can run `open code run --command` with a tailored command for whatever vulnerabilities you're looking for.
newer models have larger context windows, and more stable reasoning across larger context windows.
If you point your model directly at the thing you want it to assess, and it doesn't have to gather any additional context you're not really testing those things at all.
Say you point kimi and opus at some code and give them an agentic looping harness with code review tools. They're going to start digging into the code gathering context by mapping out references and following leads.
If the bug is really shallow, the model is going to get everything it needs to find it right away, neither of them will have any advantage.
If the bug is deeper, requires a lot more code context, Opus is going to be able to hold onto a lot more information, and it's going to be a lot better at reasoning across all that information. That's a test that would actually compare the models directly.
Mythos is just a bigger model with a larger context window and, presumably, better prioritization and stronger attention mechanisms.
Harnesses are basically doing this better than just adding more context. Every time, REGARDLESS OF MODEL SIZE, you add context, you are increasing the odds the model will get confused about any set of thoughts. So context size is no longer some magic you just sprinkle on these things and they suddenly dont imagine things.
So, it's the old ML join: It's just a bunch of if statements. As others are pointing out, it's quite probably that the model isn't the thing doing the heavy lifting, it's the harness feeding the context. Which this link shows that small models are just as capabable.
Which means: Given a appropiately informed senior programmer and a day or two, I posit this is nothing more spectacular than a for loop invoking a smaller, free, local, LLM to find the same issues. It doesn't matter what you think about the complexity, because the "agentic" format can create a DAG that will be followable by a small model. All that context you're taking in makes oneshot inspections more probable, but much like how CPUs have go from 0-5 ghz, then stalled, so too has the context value.
Agent loops are going to do much the same with small models, mostly from the context poisoning that happens every time you add a token it raises the chance of false positives.
I know you're right that there's a saturation point for context size, but it's not just context size that the larger models have, it's better grounding within that as a result of stronger, more discriminative attention patterns.
I'm not saying you're not going to drive confusion by overloading context, but the number of tokens required to trigger that failure mode in opus is going to be a lot higher than the number for gpt-oss-20b.
I'm pretty sure a model that can run on a cellphone is going to cap out it's context window long before opus or mythos would hit the point of diminishing returns on context overload. I think using a lower quality model with far fewer / noisier weights and less precise attention is going to drive false positives way before adding context to a SOTA model will.
You can even see here, AISLE had to print a retraction because someone checked their work and found that just pointing gpt-oss-20b at the patched version generated FP consistently: https://x.com/ChaseBrowe32432/status/2041953028027379806
To clarify, I don't necessarily agree with the post or their approach. I just thought folks were misreading it. I also think it adds something useful to the conversation.
> That's why their point is what the subheadline says, that the moat is the system, not the model.
I'm skeptical; they provided a tiny piece of code and a hint to the possible problem, and their system found the bug using a small model.
That is hardly useful, is it? In order to get the same result , they had to know both where the bug is and what the bug is.
All these companies in the business of "reselling tokens, but with a markup" aren't going to last long. The only strategy is "get bought out and cash out before the bubble pops".
You can imagine a pipeline that looks at individual source files or functions. And first "extracts" what is going on. You ask the model:
- "Is the code doing arithmetic in this file/function?"
- "Is the code allocating and freeing memory in this file/function?"
- "Is the code the code doing X/Y/Z? etc etc"
For each question, you design the follow-up vulnerability searchers.
For a function you see doing arithmetic, you ask:
- "Does this code look like integer overflow could take place?",
For memory:
- "Do all the pointers end up being freed?"
_or_
- "Do all pointers only get freed once?"
I think that's the harness part in terms of generating the "bug reports". From there on, you'll need a bunch of tools for the model to interact with the code. I'd imagine you'll want to build a harness/template for the file/code/function to be loaded into, and executed under ASAN.
If you have an agent that thinks it found a bug: "Yes file xyz looks like it could have integer overflow in function abc at line 123, because...", you force another agent to load it in the harness under ASAN and call it. If ASAN reports a bug, great, you can move the bug to the next stage, some sort of taint analysis or reach-ability analysis.
So at this point you're running a pipeline to:
1) Extract "what this code does" at the file, function or even line level.
2) Put code you suspect of being vulnerable in a harness to verify agent output.
3) Put code you confirmed is vulnerable into a queue to perform taint analysis on, to see if it can be reached by attackers.
Traditionally, I guess a fuzzer approached this from 3 -> 2, and there was no "stage 1". Because LLMs "understand" code, you can invert this system, and work if up from "understanding", i.e. approach it from the other side. You ask, given this code, is there a bug, and if so can we reach it?, instead of asking: given this public interface and a bunch of data we can stuff in it, does something happen we consider exploitable?
That's funny, this is how I've been doing security testing in my code for a while now, minus the 'taint analysis'. Who knew I was ahead of the game. :P
In all seriousness though, it scares me that a lot of security-focused people seemingly haven't learned how LLMs work best for this stuff already.
You should always be breaking your code down into testable chunks, with sets of directions about how to chunk them and what to do with those chunks. Anyone just vaguely gesturing at their entire repo going, "find the security vulns" is not a serious dev/tester; we wouldn't accept that approach in manual secure coding processes/ SSDLCs.
In a large codebase there will still be bugs in how these components interoperate with each other, bugs involving complex chaining of api logic or a temporal element. These are the kind of bugs fuzzers generally struggle at finding. I would be a little freaked out if LLMs started to get good at finding these. Everything I've seen so far seems similar to fuzzer finds.
I think there is already papers and presentations on integrating these kind of iterative code understanding/verificaiton loops in harnesses. There may be some advantages over fuzzing alone. But I think the cost-benefit analysis is a lot more mixed/complex than anthropic would like people to believe. Sure you need human engineers but it's not like insurmountably hard for a non-expert to figure out
Tunnel vision? If your model can handle big context, why divide into lesser problems to conquer - even if such splitting might be quite trivial and obvious?
It's the difference of "achieve the goal", and "achieve the goal in this one particular way" (leverage large context).
I meant, if the claim here is that small models can accomplish the same things with good scaffolding, why didn’t they demonstrate finding those problem with good scaffolding rather than directly pointing them at the problem?
Lot of people in this thread don't seem to be getting that.
If another model can find the vulnerability if you point it at the right place, it would also find the vulnerability if you scanned each place individually.
People are talking about false positives, but that also doesn't matter. Again, they're not thinking it through.
False positives don't matter, as you can just automatically try and exploit the "exploit" and if it doesn't work, it's a false positive.
Worse, we have no idea how Mythos actually worked, it could have done the process I've outlined above, "found" 1,000s of false positives and just got rid of them by checking them.
The fundamental point is it doesn't matter how the cheap models identified the exploit, it's that they can identify the exploit.
When it turns out the harness is just acting as a glorified for-each brute force, it's not the model being intelligent, it's simply the harness covering more ground. It's millions of monkeys bashing type-writers, not Shakespeare at one.
It’s strange to see this constant “I could do that too, I just don’t want to” response.
Finding an important decades-old vulnerability in OpenBSD is extremely impressive. That’s the sort of thing anyone would be proud to put on their resume. Small models are available for anyone to use. Scaffolding isn’t that hard to build. So why didn’t someone use this technique to find this vulnerability and make some headlines before Anthropic did? Either this technique with small models doesn’t actually work, or it does work but nobody’s out there trying it for some reason. I find the second possibility a lot less plausible than the first.
From the article:
>At AISLE, we've been running a discovery and remediation system against live targets since mid-2025: 15 CVEs in OpenSSL (including 12 out of 12 in a single security release, with bugs dating back 25+ years and a CVSS 9.8 Critical), 5 CVEs in curl, over 180 externally validated CVEs across 30+ projects spanning deep infrastructure, cryptography, middleware, and the application layer.
They have been doing it (and likely others as well), but they are not anthropic which a million dollar marketing budget and a trillion dollar hype behind it, so you just didn't hear about it.
> If another model can find the vulnerability if you point it at the right place, it would also find the vulnerability if you scanned each place individually.
They didn't just point it at the right place, they pointed it at the right place and gave it hints. That's a huge difference, even for humans.
> That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do.
Unless the context they added to get the small model to find it was generated fully by their own scaffold (which I assume it was not, since they'd have bragged about it if it was), either they're admitting theirs isn't well designed, or they're outright lying.
People aren't missing the point, they're saying the point is dishonest.
Your comment is a great addition to the conversation. I do just want to add the scale point, though. That source is less than a million was bet, and now there are hundreds of millions bet on the ceasefire, and millions more on other Iran bets, and then even more bet on Russia/Ukraine.
So I think it's a bit overstated to frame it as nothing is new. But your historical context is helpful.
> That source is less than a million was bet, and now there are hundreds of millions bet on the ceasefire, and millions more on other Iran bets, and then even more bet on Russia/Ukraine.
I'd imagine those numbers are typical for any transaction facilitated by the Internet comparing 2003 to 2026.
It seems to me there's an explosion in just the last few years that's noteworthy -- that gambling on war did not simply grow hand-in-hand with all web purchases. Am happy to be wrong, but that's my experience seeing this play out.
> 1. Give massive preferential treatment to women during university
"Some schools are trying to attract male applicants by improving their sports programs; others invest more heavily in buying boys’ email addresses or give incentives to boys that they do not offer to girls — such as free stickers or baseball caps — for filling out information on the school website. Marketing materials are sometimes designed to speak specifically to young men. Heath Einstein, dean of admission of Texas Christian University in Fort Worth, recalls fondly a pamphlet known internally as the “bro-chure,” which featured shots of the football team, a rock climber and a male student shoving cake in his mouth. But the easiest way for many competitive schools to fix their gender ratios lies in the selection process, at which point admissions officers often informally privilege male applicants, a tendency that critics say amounts to affirmative action for men."
It's not the NYT making the claims, they are merely reporting on them. And I think it's weird to just dismiss all evidence that conflicts with your priors. But fine, how about the NY Post:
Or how about this quote from a former admissions officer at Brandeis:
"We might admit the male student and wait-list the female student because of wanting to get closer to this sort of gender parity in terms of percentages in the class."
For anybody who watched the new chess documentary on Netflix about the Magnus vs Hans drama, you'll remember how pissed Magnus and his dad are at chess.com, which is what makes this partnership interesting to me. (Magnus is a TTT cofounder)
There are important factual differences compared to the challenges against OpenAI, but I think yes this decision does ultimately offer them some new legal protection against whatever customers decide to do with their tools.
That's a good point too and one I wasn't even trying to make. I was thinking more in terms of how the big LLM players trained their models using torrents etc.
I have no interest in anything crypto, but they are making a proposal about NFTs tied to AI (LLMs and verifiable machine learning) so they can make ownership decisions.
So it'd be alive in the making decisions sense, not in a "the technology is thriving" sense.
"Polymarket condemns the harassment and threats directed at Emanuel Fabian, or anyone else for that matter. This behavior violates our terms of service and has no place on our platform or anywhere else. Prediction markets depend on the integrity of independent reporting. Attempts to pressure journalists to alter their reporting undermine that integrity and undermine the markets themselves."
It's a little light on rigor. There's a lot of "X happened before Y therefore X caused Y" but I think it's a really good articulation of some of the problems it tackles.
Yes in sports. NCAA is trying to specifically get player prop bets banned in part because they are more likely not just to be corrupt, but also they specifically mention player safety.
I linked this elsewhere, but I found Nate Silver's take interesting here. He specifically takes on the problem of player prop bets and says they produce abusive behavior toward players.
I think it's reasonable to research if there's some equivalent to player prop bets in other contexts, like anything that might hinge on what a reporter writes. There's the safety angle and there's also just the "more likely to be corrupted" angle. Like I said in my original comment, I'd just like them to evaluate this.
Related to your comment, see Nate Silver's take on the NBA gambling scandal.
"Player props are inherently more subject to manipulation because only one player needs to be involved in rigging the outcome. [...] Even as someone more sympathetic to gambling than most people you’re probably reading on this story, I’m not sure I’d really care if player prop bets were banned entirely. They produce abusive behavior toward players. They also often put team and individual performance into tension with one another..."
> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints. The models' performance here is an upper bound on what they'd achieve in a fully autonomous scan. That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do.
That's why their point is what the subheadline says, that the moat is the system, not the model.
Everybody so far here seems to be misunderstanding the point they are making.
reply