This helps, but the original prompt is still there. The system prompt is still influencing these thinking blocks. They just don’t end up clogging up your context. The system prompt sits at the very top of the context hierarchy. Even with isolated "thinking" blocks, the reasoning tokens are still autoregressively conditioned on the system instructions. If the system prompt forces "caveman speak" the model's attention mechanisms are immediately biased toward simpler, less coherent latent spaces. You are handicapping the vocabulary and syntax it uses inside its own thinking process, which directly throttles its ability to execute high-level logic.
I get your point but it seems that extended thinking is based on a hidden system prompt that is not so much affected by the style the user defines. Probably it's a bit in between.
That is part of it. They are also trained to think in very well mapped areas of their model. All the RHLF, etc. tuned on their CoT and user feedback of responses.
That is not how CoT works. It is all in context. All influenced by context. This is a common and significant misunderstanding of autoregressive models and I see it on HN a lot.
That "unproven claim" is actually a well-established concept called Chain of Thought (CoT). LLMs literally use intermediate tokens to "think" through problems step by step. They have to generate tokens to talk to themselves, debug, and plan. Forcing them to skip that process by cutting tokens, like making them talk in caveman speak, directly restricts their ability to reason.
Agents can simply be told to write code in a functional style. They won’t complain. Think of it like a constraint system or proofs system. The agent can better reason about the code and side effects. Etc.
Agents are very good at following and validating constraints and hill climbing. This makes sense to me. Humans benefit too, but it is hard to get a bunch of humans to follow the style and maintain it over time.
Agents are useful because they don't inherit context from their parent context. They're basically "compaction" at a small scale. They succeed because context pollution create greater indeterminancy. The fact that you can spin up many of them is not primary benefit of them.
There will always be more bugs than we can fix. AI can patch as well, but if your system is difficult to test and doesn't have rigorous validation you will likely get an unacceptable amount of regression.
There is something powerful about environment and what it does to our minds. For the author, giving up the monitor is totally valid and may work for many people. I can often convince myself to chance a habit by adding a simple extra physical step. This is harder on a computer. It takes discipline to not just end up with dozens of windows and even more browser tabs in some roles. I just aggressively close windows when starting a new task or thinking. Most likely you don't need whatever you are closing :)
Forcing short responses will hurt reasoning and chain of thought. There are some potential benefits but forcing response length and when it answers things ironically increases odds of hallucinations if it prioritizes getting the answer out. If it needed more tokens to reason with and validate the response further. It is generally trained to use multiple lines to reason with. It uses english as its sole thinking and reasoning system.
Nothing on that page indicates otherwise.