I see the value in that, but there are a few reasons that isn't on the immediate roadmap -- mainly, it shifts focus from measuring the model to measuring the harness. The agentic benchmark section you see on the site is comparable to how an agent would perform using an open harness like Pi. But latest tool-using models are pretty well adapted to any harness, so I think that's less of a factor in overall model performance.
I haven't tried it myself, but, I would assume that this sort of instruction in CLAUDE.md would indeed make it a bit more careful, to the detriment of its development velocity, which for my use-case would be bad. I generally prefer for it to experiment in many directions rapidly, and only once we have an approach that solves the problem well, to do extensive testing.
When I was younger I was sold in the idea of data driven decisions. Everything needs to be measured, otherwise you are just biased, and bias is bad. Nowadays I do still rely on data and measurements but I also have experience and taste to judge things. Answering your question, the latter.
First result is Windows which has had more problems with Codex (or at least, up until a few months ago). Second is someone who asked Codex to delete all files that were unrelated to the project files.
You can get one Good Friday a year if you live in a country that treats it as bank holiday, or is Catholic enough that it's effectively a day off, even if not an official one.
You can get extra Fridays off if you move to a country with bank holidays that tend to land on Fridays, which is correlated with history of either communism or organized religion (much like the weekend).
But, if you want every Friday off, your best bet is to embrace hyper-capitalism and worm your way money so you can have four-day work week.
(Easier to achieve than the legendary four-hour work week anyway.)
TL;DR: the more opposing ideologies you can simultaneously hold, the more days off in a week you're morally entitled to :).
I can argue that disaster started mid-4.6, when they started juggling with rate limits while hitting uptime problems. Great we have some healthy competition and waiting for the next move from Deepmind.
reply