The trick is, with the setup I mentioned, you change the rewards.
The concept is:
Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.
Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.
Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.
It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.
Social media is 90% outrage farming by nation states and individuals with agendas. I think the experiment failed. Humans in large groups are so trivially manipulated by algorithms.
Our ape brains at large just can't deal with a firehose of manipulation. We're just giving bad actors a key to our subconscious to destroy the fabric of our civilization, and those bad actors are using it as much as they can.
This article is really missing the discussion on the fact that social media is far, far more inauthentic than real humans these days.
Counter-example: Reed Hastings, co-founder and the CEO of Netflix for 22 years, famously did the opposite of what pg is saying. Reed insisted on a particular style of employee freedom & responsibility that IMO set the benchmark for innovating year after year and avoiding micro-managers, even as it scaled up past 2000 engineers. This story still has not been fully told. Reed was closely involved but perhaps the opposite of Steve Jobs.
Sounds like chesky and pg want to turn the tide on that dominant culture in software companies. And I couldn't agree more! A big problem IMO is that most "professional software managers" are taught a management style that focuses on risk. Risk-aversion permeates every decision from compensation to project priorities. It's so pervasive it's like the air they breathe, they don't even realize their doing it. This is how things run in 99% of companies.
So, my fellow hackers. There is a better way. It's neither the Steve Jobs model nor the John Sculley model. Looks like pg has not yet found it. I hope he does, though. It would be great for YC to encourage experimentation here.
Manager of a hybrid team here, 2 remote, 4 in 2 different remote offices, and 4 in the office with me.
My experience is that remote workers often have higher velocity but lower agility. When there's a well-defined task and little ambiguity, remote workers can usually complete it faster than in-office workers. But when the task is highly ambiguous, requires many course corrections, involves rapid communication, or relies on a large degree of trust, the in-office teams end up more productive. It all stems from known research on the benefits & detriments of office work, i.e. offices build trust and allow higher-bandwidth communication, but they also have more distractions and a less comfortable work environment.
I think the trigger that would bring people back to the office is a new economic boom based on new and unknown technology. That creates a highly ambiguous environment where you're forging ahead in unknown directions and need a lot of trust in leadership to make progress at all. Established companies with known markets probably would be better off adopting remote work - the employees work faster, and there are already well-known processes and strategic direction. Of course, if you're already well-established in your market, it probably doesn't matter what you do.
As some one who lives under a blasphemy culture, I think American greatly take for granted their freedoms of speech, and it boggles me that this is not a corner stone of the American Left.
While I could link to hundreds of example, this one example (out of hundreds) that occurred at a university is example of offense culture taken to extreme and weaponized:
Just the implication of possible offense caused this incident. Sure, the main culprits had other ideas, but the mob was there purely due to supposed offense. If you dare, you are welcome to look [1] the videos of the incident, it's horrifying.
And as a proxy, you can also guess just how guarded and buffered speech here is, and yet such events happen.
This is a slippery slope, and just because this happened for a few decades in the West doesn't mean it can't happen at all. And it's not the just the left, January 6 has shown that the American right are threat vector also..
Needless to remind you, it contains some NSFL material, I only linked because I think some education of reality is in order. This is the extreme end for the tolerance to offense.
The concept is:
Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.
Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.
Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.
It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.