Salut Christophe! Yes, I’ve come across the concept :) In fact, I think what we did with the ├── and └── notation is already a step in that direction (at least concept-wise) as it also puts a specific structure over the instructions.
But stretching all the way seems worth exploring too!
I think there's a chance we could squeeze a better benchmark score, although there's a risk of overfitting which I wanted to avoid.
The simplest test would be to make previously “unreachable” tasks succeed through obvious prompt tweaks — like reordering instructions or emphasizing key parts.
That said, my methodology intentionally avoided exposing the model to actual tasks. Instead, I focused on the domain as a whole: refining the instructions so a smaller model could understand and act reliably.
Great point! However, I’d ask the following: isn't faithfully following nuanced instructions an _agentic capability_ by itself?
If a model only performs well once the rules are clarified, that’s still revealing something important about its agency: it’s brittle when policies are ambiguous, but much stronger when they’re structured.
I agree with you that there’s a fine line between genuinely helping the model 'understand' the task and just 'teaching to the test'.
That said, Tau² is framed as a very specific use case — and we showed it can be solved more reliably. At the end of the day, that means we now have an agent built on a cheaper, faster model that still performs its job with higher reliability.
I only had Claude rewrite the domain policies and generic instructions, not the individual task statements. I updated the blog with a link showing the exact changes.
So no leakage — it wasn’t solving or hinting at any of the specific test cases, since none of the tasks were ever exposed to it.
Great point. Indeed my methodology was to treat the prompt refactoring as one-off task, therefore I didn't care much about cost/latency.
As for having GPT-5-mini do the rewriting — that’s a really interesting idea. I think the biggest challenge is avoiding cognitive overload. The Tau² agent policies are pretty complex: it’s easy to grasp the overall task, but the detailed rules for each user case aren’t always obvious.
I'm not sure if how easy it is to actually overload GPT-5-mini, so that's definitely worth exploring.
Yea, so that part I actually did not overthink - I knew I need strong reasoning and just grabbed opus which is my personal go-to for such tasks and sticked to it as I wanted to avoid too many moving parts.
Would be interesting to compare both the benchmark result as well as the way other models approached the whole refactoring process!
Thanks for the feedback, appreciate it!
It makes lot of sense - I'll update the article with links to the actual prompts.
Initially I thought these would be too lengthy for the article and no one would care, but as it seems people are really interested in it. Of course I'd be happy to share the details.
I see that you've added links to a pull request that show the previous and final optimized prompts. However, the OP was asking for the prompt you gave to Claude to assist you in optimizing your prompt. Would you mind sharing that one? (That way nobody has to reverse engineer the instructions from the diff you provided.)