I would doubt it. They are mostly trained on natural language. They may be getting some visual reasoning capability from multi-modal training on video, but their reasoning doesn't seem to generalize much from one domain to another.
Some future AGI, not LLM based, that learns from it's own experience based on sensory feedback (and has non-symbolic feedback paths) presumably would at least learn some non-symbolic reasoning, however effective that may be.
My argument for this is mostly that we don't use language for all forms of reasoning, and are likely doing so on some internal representations or embeddings. Animals also demonstrate abilities to reason with situations without actually having a language.
I see language more as a protocol for inter-agent communication (including human-human communication) but it contains a lot of inefficiencies and historical baggage and is not necessarily the optimal representation of ideas within a brain.
This "figuring out" is just going to come from stuff it was trained on - people discussing why LLMs fail at certain things, and those people (training samples) not always being correct about it!
The "How many R's in "strawberry, counting words in a sentence, reversing strings. I process text as tokens, not characters, so these are surprisingly error-prone" explanation sounds plausible, but I don't think it it correct.
Any model I've ever tried that failed on things like "R's in strawberry" was quite capable of reliably returning the letter sequence of the word, so the mapping of tokens back to letters is not the issue, as should also be obvious by ability of models to do things like mapping between ASCII and Base64 (6 bits/char => 2 letters encode 3 chars). This is just sequence to sequence prediction, which is something LLMs excel at - their core competency!
I think the actual reason for failures at these types of counting and reversing tasks is twofold:
1) These algorithmic type tasks require a step-by-step decomposition and variable amount of compute, so are not amenable to direct response from an LLM (fixed ~100 layers of compute). Asking it to plan and complete the task in step-by-step fashion (where for example it can now take advantage of it's ability to generate the letter sequence before reversing it, or counting it) is going to be much more successful. A thinking model may do this automatically without needing to be told do it.
2) These types of task, requiring accurate reference and sequencing through positions in its context, are just not natural tasks for an LLM, and it is probably not doing them (without specific prompting) in the way you imagine. Say you are asking it to reverse the letter sequence of a 10 letter word, and it has somehow managed to generate letter # 10, the last letter of the word, and now needs to copy letter #9 to the output. It will presumably have learnt that 10-1 is 9, but how to use that to access the appropriate position in context (or worse yet if you didn't ask it to go step by step and first generate the letter sequence, so the sequence doesn't even exist in context!)? The letter sequence may have quotes and/or commas or spaces in it, and altogether starts at a given offset in the context, so it's far more difficult than just copying token at context position #9 ! It's probably not even actually using context positions to do this, at least not in this way. You can make tasks like this much easier for the model by telling it exactly how to perform it, generating step-by-step intermediate outputs to track it's progress etc.
BTW, note that the model itself has no knowledge of, or insight into, the tokenization scheme that is being used with it, other than what is available on the web, or that it might have been trained to know. In fact, if you ask a strong model how it could even in theory figure out (by experimentation) it's own tokenization scheme, it will realize this is next to impossible. The best hope might be some sort of statistical analysis of it's own output, hoping to take advantage of the fact that it is generating sub-word token probabilities, not word probabilities. Sonet 4.6's conclusion was "Without logprob access, the model almost certainly cannot recover its exact tokenization scheme through introspection or behavioral self-probing alone".
It's interesting to see Opus 4.7 follow so soon after the announcement of Mythos, especially given that Anthropic are apparently capacity constrained.
Capacity is shared between model training (pre & post) and inference, so it's hard to see Anthropic deciding that it made sense, while capacity constrained, to train two frontier models at the same time...
I'm guessing that this means that Mythos is not a whole new model separate from Opus 4.6 and 4.7, but is rather based on one of these with additional RL post-training for hacking (security vulnerability exploitation).
The alternative would be that perhaps Mythos is based on a early snapshot of their next major base model, and then presumably that Opus 4.7 is just Opus 4.6 with some additional post-training (as may anyways be the case).
It seems a lot of the problem isn't "token shrinkage" (reducing plan limits), but rather changes they made to prompt caching - things that used to be cached for 1 hour now only being cached for 5 min.
Coding agents rely on prompt caching to avoid burning through tokens - they go to lengths to try to keep context/prompt prefixes constant (arranging non-changing stuff like tool definitions and file content first, variable stuff like new instructions following that) so that prompt caching gets used.
This change to a new tokenizer that generates up to 35% more tokens for the same text input is wild - going to really increase token usage for large text inputs like code.
AFAIK the way caching works is at API key level, which will be shared across the main/parent agent and all subagents.
Note that the model API is stateless - there is no connection being held open for the lifetime of any agent/subagent, so the model has no idea how long any client-side entity is running for. All the model sees over time is a bunch of requests (coming from mixture of parent and subagents) all using the same API key, and therefore eligible to use any of the cached prompt prefixes being maintained for that API key.
Things like subagent tool registration are going to remain the same across all invocations of the subagent, so those would come from cache as long as the cache TTL is long enough.
If you have OCD then do not look at those pictures!
I find these annoying. I guess they are going for organic/realistic rather than too perfect, but every other aspect of the photos - the aesthetically melting cheese, etc - follows the norms of fake fast food photography, so why bother?
I'm not sure I'd call it a "trick", but since A ^ 0 = A, and B ^ B = 0, then ((A ^ B) ^ B) = A. i.e. XOR-ing any number by the same number twice gets you back the original number.
This used to be used back in the day for cheap and nasty computer graphics, since it means that if you draw to the screen by XOR-ing with the pixels already on the screen then you can undo it, restoring the background, by doing it a second time. The "nasty" part is that XORing with what's already on the screen isn't going to look great, but for something like a rotating wire-frame figure it might be OK.
LLM output doesn't have the variety of human output, since they operate in fixed fashion - statistical inference followed by formulaic sampling.
Additionally, the statistics used by LLMs are going be be similar across different LLMs since at scale its just "the statistics of the internet".
Human output has much more variety, partly because we're individuals with our own reading/writing histories (which we're drawing upon when writing), and partly because we're not so formulaic in the way we generate. Individuals have their own writing styles and vocabulary, and one can identify specific authors to a reasonable degree of accuracy based on this.
It's a bit like detecting cheating in a chess tournament. If an unusually high percentage of a player's moves are optimal computer moves, then there is a high likelihood that they were computer generated. Computers and humans don't pick moves in the same way, and humans don't have the computational power to always find "optimal" moves.
Similarly with the "AI detectors" used to detect if kids are using AI to write their homework essays, or to detect if blog posts are AI generated ... if an unusually high percentage of words are predictable by what came before (the way LLMs work), and if those statistics match that of an LLM, then there is an extremely high chance that it was written by an LLM.
Can you ever be 100% sure? Maybe not, but in reality human written text is never going to have such statistical regularity, and such an LLM statistical signature, that an AI detector gives it more than a 10-20% confidence of being AI, so when the detector says it's 80%+ confident something was AI generated, that effectively means 100%. There is of course also content that is part human part AI (human used LLM to fix up their writing), which may score somewhere in the middle.
> LLM output doesn't have the variety of human output, since they operate in fixed fashion - statistical inference followed by formulaic sampling.
This is the wrong thing to look at; your chess analogy is much stronger, the detection method similar (if you can figure out a prompt that generates something close to the content, it almost certainly isn't human origin).
But to why the thing I'm quoting doesn't work: If you took, say, web comic author Darren Gav Bleuel, put him in a sci-fi mass duplication incident make 950 million of him, and had them all talking and writing all over the internet, people would very quickly learn to recognise the style, which would have very little variety because they'd all be forks of the same person.
Indeed, LLMs are very good at presenting other styles than their defaults, better at this than most humans, and what gives away LLMs is that (1) very few people bother to ask them to act other than their defaults, and (2) all the different models, being trained in similar ways on similar data with similar architectures, are inherently similar to each other.
An LLM is just computer function that predicts next word based on the input you give it. It doesn't make any difference what the input is (e.g. please respond in style X) - the function doesn't change, and the statistical signature of how it works will still be there.
If you don't believe me, try it for yourself. Ask an AI to generate some text and give it to the AI detector below (paste your text, then click on scan). Now ask the AI to generate in a different style and see if it causes the detector to fail.
I can't use that linked app, paywall immediately. Unlike the person you were replying to here[0], I do not claim that this is impossible:
LLM is indeed just computer function that does stats. And our brains are just electro-chemistry that does stats. This is why stylometric analysis of human writing is a thing.
My previous experience with things such as you have linked to, is they used to be quite poor. I assume they're better since then, but then again so are the models.
> I assume they're better since then, but then again so are the models.
Yes, but "better" means different things for each of these.
Detectors are trying to get better at distinguishing human from LLM-generated text.
LLMs are being improved to generate more useful (and benchmark maxxing) outputs, not to attempt to avoid detection.
LLMs are in fact explicitly trained to be as predictable as possible. The training goal is to minimize continuation prediction errors, which means they are in effect being trained to generate output where each word can be predicted by what came before it (which we can contrast to a human who tries to spice it up and keep it interesting by not being too predictable!).
RL post-training, which is especially used for computer code and math, is going to change this word-by-word predictability (detectability) a bit since the focus is now on a longer term goal rather than next word, but to some extent you could also view it as just steering/narrowing the output of the model towards that goal, not totally overriding the next-word statistics.
I don't know if there are AI detectors specifically trained to detect AI code rather than prose, but I'd expect that is more difficult to do, both because of the RL factor, and because computer code is so predictable in the first place - adhering to rigid syntax etc.
A human can easily produce output that looks like anything an LLM can produce, therefor an LLM detector that can say "this is 100% written by AI" cannot exist. It's really that simple.
> Can you ever be 100% sure? Maybe not
The commenter I was replying to claimed exactly this. Their AI detector showed that the text was "100%" AI generated.
I was just expressing some caution. Saying you are 100% certain of anything when the evidence is statistical seems a bit too certain, especially if it was just from a short text sample.
Compare to flipping a coin, counting heads vs tails, and trying to assess if it's a fair or biased coin. After 1000 flips if it's not close to 50/50 you would rightfully be suspect, and if it was 10/90 you should be almost certain it's biased. But you can never be 100% sure.
Bottom line is that MCP doesn't change anything in the way the model discovers and invokes tools, so MCP doesn't help with the issue of lack of standard tool call syntax.
1) The way basic non-MCP tool use works is that the client (e.g. agent) registers (advertises) the tools it wants to make available to the model by sending an appropriate chunk of JSON to the model as part of every request (since the model is stateless), and if the model wants to use the tool then it'll generate a corresponding tool call chunk of JSON in the output.
2) For built-in tools like web_search the actual implementation of the tool will be done server-side before the response is sent back to the client. The server sees the tool invocation JSON in the response, calls the tool and replaces the tool call JSON with the tool output before sending the updated response back to the client.
3) For non-built-in tools such as the edit tool provided by a coding agent, the tool invocation JSON will not be intercepted server-side, and is instead just returned as-is to the client (agent) as part of the response. The client now has the responsibility of recognizing these tool invocations and replacing the invocation JSON with the tool output the same as the server would have done for built-in tools. The actual "tool call" can be implemented by the client however it likes - either internal to the client to by calling some external API.
4) MCP tools work exactly the same as other client-provided tools, aside from how the client learns about them, and implements them if the model chooses to use them. This all happens client side, with the server/model unaware that these client tools are different from any others it is offering. The same JSON tool registration and JSON tool call syntax will be used.
What happens is that client configuration tells it what MCP servers to support, and as part of client initialization the client calls each MCP server to ask what tools it is providing. The client then advertises/registers these MCP tools it has "discovered" to the model in the normal way. When the client receives a tool call in the model response and sees that it is an MCP provided tool, then it knows it has to make an MCP call to the MCP server to execute the tool call.
TL/DR
o the client/agent talks standard MCP protocol to the MCP servers
o the client/agent talks model-specific tool use protocol to the model
There’s a lot of wasted compute with stateless inference. We set out to solve that with a new computational model for transformers, and only process the delta between requests. That’s how we can achieve crazy low latency tool calling with LayerScale. Check it out https://layerscale.ai and technical whitepapers out next month!
They have changed default CC effort to xhigh.
They have said that Opus 4.7 will generate more tokens than 4.6 at same effort level.
They have increased their image input resolution meaning more tokens per image.
etc.
Maybe they are also extracting another 5% tokens from you by prompting it to not talk like a caveman, but that would hardly be noticeable.
reply