Most "sparse training" in PyTorch today isn't actually sparse. A binary mask gets multiplied into a dense weight matrix, which means the zeros still consume memory, still move through the cache, and still get multiplied. That's pruning simulation, not sparse computation. SparseLab does the other thing: real sparse storage (a custom Padded-CSR layout), custom NEON kernels, in an nn.Linear-compatible layer.
The premise. It's been known since the Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635) and RigL (https://arxiv.org/abs/1911.11134) that most models train competitively with ~10% of their parameters if those parameters are the right ones, chosen dynamically during training. Every year since, researchers have reproduced this in masked-dense simulation, then hit a wall when they want the actual memory savings. PyTorch's torch.sparse_csr isn't designed for training — the backward pass is unimplemented for most ops, and the ones that exist force dense intermediates, which defeats the point. The alternative has been to write your own CSR + SIMD kernels, a six-month detour from whatever you were actually trying to study. SparseLab is that detour, packaged.
Reproduced numbers (M3 Pro, all in the repo's docs/demos/):
- MLP on MNIST at 90% sparsity (10% of params live): 97.45% vs 98.06% dense — 0.61pp gap, 82% memory reduction. Sparse needed 1.8x more epochs to converge.
- 10M-param transformer on Tiny Shakespeare, 70% sparse attention + 90% sparse FFN, 10k steps: inference memory 15.3 MB vs 41.0 MB (37% of dense), 0.055 nats validation loss gap.
- Scaling check at 40M params, 1000 steps (same architecture family, 4x larger): inference memory 55.8 MB vs 150.7 MB dense — exactly 37% of dense again. The ratio held across the scale-up. Per-step slowdown narrowed from 4.6x to 4.1x as kernel time started dominating Python overhead.
- The honest caveat: on CPU we are still 4.1-4.6x slower per step than dense torch.matmul. The dW kernel is most of a step and is unvectorized in v0.1. Memory is the win, not speed.
Why CPU-first is the angle. A DGX H100 has 640 GB of GPU HBM across 8 cards and costs $200-400K up front. Ten Hetzner AX102 nodes at ~€104/month each give you 1.28 TB of DDR5 — 2x the trainable memory at a fraction of the capital cost, paid monthly. For independent researchers training in the 100M-1B param range, RAM is the binding constraint, not FLOPs. Real sparse storage turns "doesn't fit in HBM" into "fits in DDR5, trains slow, but trains." DDP for wallclock recovery is on the v0.2 roadmap.
API. Install with "pip install sparselab" (wheels for macOS arm64, Linux x86_64, Linux aarch64). One-line swap from nn.Linear:
import sparselab
layer = sparselab.SparseLinear(1536, 384, sparsity=0.9)
algo = sparselab.RigL(sparsity=0.9, drop_fraction=0.3, update_freq=100)
layer.apply(algo) # mutates topology during training
Help wanted. The aim is for SparseLab to become solid scaffolding for sparse-from-scratch work. Four places a contributor can own something real:
1. A new DST algorithm as a PR — Sparse Momentum, Top-KAST, GraNet. SparsityAlgorithm is ~100 lines; a new algorithm is another ~100.
2. CPU perf — dW kernel NEON/AVX-512 vectorization + parallel scheduling is the highest-leverage contribution. The 40M scaling numbers quantify exactly why.
3. CUDA port of SpMM + rewrite kernels. v0.1 is CPU-only; the layout is GPU-friendly and a CUDA port is the third contributor track.
4. Push the scaling further. We validated the memory ratio at 40M. The 100M+ regime is open territory — if you have CPU cluster time, a GPT-2 small scale-up with a real convergence budget would be the first independent reproduction above author hardware.
Full paper: arXiv:2603.28846. Two circuits for ECDLP-256 — one at <1,200 logical qubits / 90M Toffoli gates, one at <1,450 / 70M. ~20x reduction in physical qubit requirements over prior estimates. Notably, Google withheld the actual circuits and instead published a ZK proof (SP1 zkVM + Groth16 SNARK) verifying correctness without exposing the implementation. They also pre-coordinated with the U.S. government before release. I wrote a first-principles derivation tracing the full chain from elliptic curve point addition to the 90M gate count: https://darshanfofadiya.com/research-papers/google-ecdlp/
As we scale to 1MN context length (inference) the biggest bottleneck is memory and to tackle that at scale we pay the price of communication overhead. Now fortunately the gpus are smartly fetching data for the next step while the previous step is computing thus masking the communication overhead and keeping responses at such scale appear realistic.
The quality degradation as context length increaes is a whole another science problem
I've been working on optimizing training for long-context models (70B+) and found that while Tensor Parallelism is well-documented, the newer "Unified" Sequence Parallelism techniques (like DeepSpeed Ulysses) are often treated as black boxes.
I wrote this deep dive to visualize exactly how we shard the Q, K, V projections and how the All-to-All communication primitives work during the attention step to handle 1M+ tokens.
The post covers:
The architectural difference between Ring Attention and Ulysses (and why Ulysses often wins on H100 clusters).
Diagrams of the specific "All-to-All" communication steps.
How to handle the KV-cache bottleneck without exploding memory.
Happy to answer questions about the implementation or the communication cost analysis!
The premise. It's been known since the Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635) and RigL (https://arxiv.org/abs/1911.11134) that most models train competitively with ~10% of their parameters if those parameters are the right ones, chosen dynamically during training. Every year since, researchers have reproduced this in masked-dense simulation, then hit a wall when they want the actual memory savings. PyTorch's torch.sparse_csr isn't designed for training — the backward pass is unimplemented for most ops, and the ones that exist force dense intermediates, which defeats the point. The alternative has been to write your own CSR + SIMD kernels, a six-month detour from whatever you were actually trying to study. SparseLab is that detour, packaged.
Reproduced numbers (M3 Pro, all in the repo's docs/demos/):
- MLP on MNIST at 90% sparsity (10% of params live): 97.45% vs 98.06% dense — 0.61pp gap, 82% memory reduction. Sparse needed 1.8x more epochs to converge.
- 10M-param transformer on Tiny Shakespeare, 70% sparse attention + 90% sparse FFN, 10k steps: inference memory 15.3 MB vs 41.0 MB (37% of dense), 0.055 nats validation loss gap.
- Scaling check at 40M params, 1000 steps (same architecture family, 4x larger): inference memory 55.8 MB vs 150.7 MB dense — exactly 37% of dense again. The ratio held across the scale-up. Per-step slowdown narrowed from 4.6x to 4.1x as kernel time started dominating Python overhead.
- The honest caveat: on CPU we are still 4.1-4.6x slower per step than dense torch.matmul. The dW kernel is most of a step and is unvectorized in v0.1. Memory is the win, not speed.
Why CPU-first is the angle. A DGX H100 has 640 GB of GPU HBM across 8 cards and costs $200-400K up front. Ten Hetzner AX102 nodes at ~€104/month each give you 1.28 TB of DDR5 — 2x the trainable memory at a fraction of the capital cost, paid monthly. For independent researchers training in the 100M-1B param range, RAM is the binding constraint, not FLOPs. Real sparse storage turns "doesn't fit in HBM" into "fits in DDR5, trains slow, but trains." DDP for wallclock recovery is on the v0.2 roadmap.
API. Install with "pip install sparselab" (wheels for macOS arm64, Linux x86_64, Linux aarch64). One-line swap from nn.Linear:
SparsityAlgorithm is modeled on Cerebras's SparsityAlgorithm API (https://training-api.cerebras.ai/en/latest/wsc/tutorials/spa...) and credited in the docstrings. v0.1 ships Static, SET, and RigL.Help wanted. The aim is for SparseLab to become solid scaffolding for sparse-from-scratch work. Four places a contributor can own something real:
1. A new DST algorithm as a PR — Sparse Momentum, Top-KAST, GraNet. SparsityAlgorithm is ~100 lines; a new algorithm is another ~100.
2. CPU perf — dW kernel NEON/AVX-512 vectorization + parallel scheduling is the highest-leverage contribution. The 40M scaling numbers quantify exactly why.
3. CUDA port of SpMM + rewrite kernels. v0.1 is CPU-only; the layout is GPU-friendly and a CUDA port is the third contributor track.
4. Push the scaling further. We validated the memory ratio at 40M. The 100M+ regime is open territory — if you have CPU cluster time, a GPT-2 small scale-up with a real convergence budget would be the first independent reproduction above author hardware.
reply