I wrote this after noticing a pattern in how alternative AI architectures evolve: Mamba rewrote its core operations as GEMMs, RetNet was abandoned by its own authors at Microsoft Research, RWKV hit a ceiling at 14B despite years of community effort.
The thesis is that Transformers and NVIDIA GPUs co-evolved into a stable attractor basin. Any architecture that wants to compete at frontier scale must pass two reinforcing gates: hardware compatibility (can you saturate Tensor Cores?) and institutional backing (will a major lab commit?). The gates reinforce each other because poor hardware compatibility makes institutional bets risky, and lack of institutional backing means no one invests in the kernel optimizations that would improve hardware compatibility.
The essay includes specific numbers on Tensor Core utilization for different architectures, analysis of why alternative chip vendors face structural barriers that Google solved in 2016 but are now nearly impossible to replicate, and three falsifiable predictions for 2027-2028.
I tried to be precise about what we can and cannot conclude from the available evidence. RetNet, for example, exposes a diagnostic blind spot in the framework: we genuinely cannot tell whether it failed because of hidden hardware friction at scale, quality degradation beyond 6.7B parameters, or pure institutional risk aversion. Microsoft never published the scaling experiments that would distinguish these hypotheses.
Would be interested to hear from anyone with inside knowledge of scaling experiments that didn't get published, or from people working on alternative architectures who have data that contradicts this framing.
This is a deep dive into ToolOrchestra (Su et al., 2025) an 8B parameter model trained with GRPO to route queries across specialized workers (code interpreters, search, frontier models).
The research is elegant. The deployment is brutal. This post covers the production failure modes I found while reverse-engineering the system:
Latency tail (P99 degrades exponentially with chain length)
"Denial of Wallet" attacks (50,000x cost amplification)
Breakeven analysis (orchestration only pays off above ~75K queries/month)
The Four Gates decision tree for adoption
Part 4 of a 4-part series. Earlier issues cover the RL mechanics (Issue 1), synthetic data (Issue 2), and emergent behaviors (Issue 3).
Author here.
I’ve been trying to answer a specific question: Why do "technically superior" architectures (like Neural ODEs, KANs, or pure SSMs) constantly fail to displace the Transformer?
My thesis is that we are looking at the wrong metric. We usually look at "flops per token" or convergence rates. But in reality, hardware imposes a "compute tax" based on how much an idea deviates from optimized GPU primitives like dense matrix multiplications (GEMMs).
I call this the Hardware Friction Map, and I’ve categorized architectures into four zones based on the engineering cost to clear "Gate 1" (viability):
1. Green Zone (Low Friction): Things like RoPE or GQA. They ship in months because they map to existing kernels.
2. Yellow Zone (Kernel Friction): FlashAttention is the standard here. Even though the math worked in 2022, it took 20+ months to become universal because of the "ecosystem tax" (integration into PyTorch, vLLM, etc.).
3. Orange Zone (System Friction): This is where MoEs sit. Everyone talks about DeepSeek V3, but we forget they had to rewrite their cluster scheduler and spend 6 months on infra to make it work. That high friction is a moat for them, but often a death sentence for startups who don't have the runway to debug distributed routing logic.
4. Red Zone (Prohibitive Friction): Architectures like KANs. They rely on tiny, irregular spline evaluations that drop tensor core utilization to ~10%. They are theoretically elegant but economically unshippable.
I also did a deep dive into the "Context Trap" for MoEs (throughput dropping ~60% at 32k context due to routing overhead) and why pure SSMs seem to hit a "scalability cliff" at 13B parameters, forcing hybrids like Jamba.
I’ve open-sourced a dataset scoring 100+ architectures on this friction scale (linked in the post). Curious to hear if others are seeing this "friction" kill internal projects.
[Discussion] How the deep learning field evolved from designing specific models to designing languages of reusable components.
This post tries to show that the Deep Learning field evolved to something that now resembles a new "language" for DL. I try to ground this idea by providing the important papers that show the evolution of DL and how this ties to the concept of a new "grammar".
To make it digestible the substack post in the link has a video overview, a podcast deep dive and an extensive written post with all the papers historically on the last 13 years that lead to the conclusion of the title.
Do discuss this idea if you like it, i'd be glad to answer questions.
The thesis is that Transformers and NVIDIA GPUs co-evolved into a stable attractor basin. Any architecture that wants to compete at frontier scale must pass two reinforcing gates: hardware compatibility (can you saturate Tensor Cores?) and institutional backing (will a major lab commit?). The gates reinforce each other because poor hardware compatibility makes institutional bets risky, and lack of institutional backing means no one invests in the kernel optimizations that would improve hardware compatibility.
The essay includes specific numbers on Tensor Core utilization for different architectures, analysis of why alternative chip vendors face structural barriers that Google solved in 2016 but are now nearly impossible to replicate, and three falsifiable predictions for 2027-2028.
I tried to be precise about what we can and cannot conclude from the available evidence. RetNet, for example, exposes a diagnostic blind spot in the framework: we genuinely cannot tell whether it failed because of hidden hardware friction at scale, quality degradation beyond 6.7B parameters, or pure institutional risk aversion. Microsoft never published the scaling experiments that would distinguish these hypotheses.
Would be interested to hear from anyone with inside knowledge of scaling experiments that didn't get published, or from people working on alternative architectures who have data that contradicts this framing.