blndrt's comments

blndrt · 2025-09-19T09:27:39 1758274059

Salut Christophe! Yes, I’ve come across the concept :) In fact, I think what we did with the ├── and └── notation is already a step in that direction (at least concept-wise) as it also puts a specific structure over the instructions. But stretching all the way seems worth exploring too!

lonefrog06 · 2025-09-24T14:44:21 1758725061

thanks for your reply

blndrt · 2025-09-18T07:29:54 1758180594

I think there's a chance we could squeeze a better benchmark score, although there's a risk of overfitting which I wanted to avoid.

The simplest test would be to make previously “unreachable” tasks succeed through obvious prompt tweaks — like reordering instructions or emphasizing key parts.

That said, my methodology intentionally avoided exposing the model to actual tasks. Instead, I focused on the domain as a whole: refining the instructions so a smaller model could understand and act reliably.

blndrt · 2025-09-18T07:19:48 1758179988

Great point! However, I’d ask the following: isn't faithfully following nuanced instructions an _agentic capability_ by itself?

If a model only performs well once the rules are clarified, that’s still revealing something important about its agency: it’s brittle when policies are ambiguous, but much stronger when they’re structured.

I agree with you that there’s a fine line between genuinely helping the model 'understand' the task and just 'teaching to the test'.

That said, Tau² is framed as a very specific use case — and we showed it can be solved more reliably. At the end of the day, that means we now have an agent built on a cheaper, faster model that still performs its job with higher reliability.

blndrt · 2025-09-18T07:02:42 1758178962

I only had Claude rewrite the domain policies and generic instructions, not the individual task statements. I updated the blog with a link showing the exact changes.

So no leakage — it wasn’t solving or hinting at any of the specific test cases, since none of the tasks were ever exposed to it.

blndrt · 2025-09-18T06:10:22 1758175822

Thank you!

Great point. Indeed my methodology was to treat the prompt refactoring as one-off task, therefore I didn't care much about cost/latency.

As for having GPT-5-mini do the rewriting — that’s a really interesting idea. I think the biggest challenge is avoiding cognitive overload. The Tau² agent policies are pretty complex: it’s easy to grasp the overall task, but the detailed rules for each user case aren’t always obvious.

I'm not sure if how easy it is to actually overload GPT-5-mini, so that's definitely worth exploring.

blndrt · 2025-09-18T05:50:07 1758174607

Yea, so that part I actually did not overthink - I knew I need strong reasoning and just grabbed opus which is my personal go-to for such tasks and sticked to it as I wanted to avoid too many moving parts.

Would be interesting to compare both the benchmark result as well as the way other models approached the whole refactoring process!

blndrt · 2025-09-17T19:51:10 1758138670

Haha, I guess my little sarcasm just earned us a masterclass! Thanks a lot for sharing your insights — really appreciate it!

blndrt · 2025-09-17T19:37:22 1758137842

Thanks! I also updated the post with the link on the website.

blndrt · 2025-09-17T14:36:31 1758119791

Thanks for the feedback, appreciate it! It makes lot of sense - I'll update the article with links to the actual prompts. Initially I thought these would be too lengthy for the article and no one would care, but as it seems people are really interested in it. Of course I'd be happy to share the details.

quinncom · 2025-09-17T22:30:22 1758148222

I see that you've added links to a pull request that show the previous and final optimized prompts. However, the OP was asking for the prompt you gave to Claude to assist you in optimizing your prompt. Would you mind sharing that one? (That way nobody has to reverse engineer the instructions from the diff you provided.)

seunosewa · 2025-09-17T16:10:27 1758125427

I checked and also couldn't find the prompt.

blndrt · 2025-09-17T17:15:33 1758129333

I published an update - you should be able to find that information at the end of the post.

Should be available now, although it might take a while for CDN to propagate.

alejoar · 2025-09-17T17:20:37 1758129637

Thanks for sharing!

blndrt · 2025-09-17T14:32:21 1758119541

Thank you for the feedback!

In this (telecom) benchmark you can review agent policies and manuals here: 1) https://github.com/sierra-research/tau2-bench/blob/main/data... 2) https://github.com/sierra-research/tau2-bench/blob/main/data...

Of course these are just parts of the prompt, you can inspect benchamark code to see how these are rendered to actual LLM calls.

In case someone is not familiar with framework methodology I've wrote a separate article covering that (with some of my thoughts) -> https://quesma.com/blog/tau2-from-llm-benchmark-to-blueprint...

HN For You