For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | mfiguiere's favoritesregister


Model training observations from both Llama 3 and 4 papers:

Meta’s Llama 3 was trained on ~16k H100s, achieving ~380–430 TFLOPS per GPU in BF16 precision, translating to a solid 38 - 43% hardware efficiency [Meta, Llama 3].

For Llama 4 training, Meta doubled the compute, using ~32K H100s and switched to FP8 precision. Despite the precision gain, observed efficiency dropped to about 19.7%, with GPUs delivering ~390 TFLOPS out of a theoretical 1,979 FP8 TFLOPS [Meta, Llama 4].

I am not the one to critique, and rather, this is a recognition of the enormous complexity of operating GPUs at this scale. Training massive models across tens of thousands of GPUs stretches today’s AI infrastructure to its limit.

Besides accelerating inference workloads, advanced GPU optimizations can be integrated into training and fine-tuning pipelines. From various kernel optimization techniques (over 90) to increasing memory access efficiency and scaling up to cluster-wide resource coordination, efficiency can be maximized with some complex software.

References: [Meta, Llama 3] https://ai.meta.com/research/publications/the-llama-3-herd-o... [Meta, Llama 4] https://ai.meta.com/blog/llama-4-multimodal-intelligence/


If true this is very nice incremental improvement. It looks like it doesn't meaningfully improve the capabilities of the model, but is cheaper to compute than RMSNorm (which essentially all current state of art LLMs use) which means faster/cheaper training.



Context is the most challenging bit. FWIW, the codebases I'm working on are still small enough to where I rarely need to include more than 12 files into context. And I find as I make the context bigger beyond that, results degrade significantly.

So I don't know how this would go in a much larger codebase.

What floored him was simply how much of my programming I was doing with an LLM / how little I write line-by-line (vs edit line-by-line).

If you're really curious, I recorded some work for a friend. The first video has terrible audio, unfortunately. This second one I think gives a very realistic demonstration – you'll see the model struggle a bit at the beginning:

https://www.loom.com/share/20d967be827141578c64074735eb84a8


That still irks me. The real problem is not tinygram prevention. It's ACK delays, and that stupid fixed timer. They both went into TCP around the same time, but independently. I did tinygram prevention (the Nagle algorithm) and Berkeley did delayed ACKs, both in the early 1980s. The combination of the two is awful. Unfortunately by the time I found about delayed ACKs, I had changed jobs, was out of networking, and doing a product for Autodesk on non-networked PCs.

Delayed ACKs are a win only in certain circumstances - mostly character echo for Telnet. (When Berkeley installed delayed ACKs, they were doing a lot of Telnet from terminal concentrators in student terminal rooms to host VAX machines doing the work. For that particular situation, it made sense.) The delayed ACK timer is scaled to expected human response time. A delayed ACK is a bet that the other end will reply to what you just sent almost immediately. Except for some RPC protocols, this is unlikely. So the ACK delay mechanism loses the bet, over and over, delaying the ACK, waiting for a packet on which the ACK can be piggybacked, not getting it, and then sending the ACK, delayed. There's nothing in TCP to automatically turn this off. However, Linux (and I think Windows) now have a TCP_QUICKACK socket option. Turn that on unless you have a very unusual application.

Turning on TCP_NODELAY has similar effects, but can make throughput worse for small writes. If you write a loop which sends just a few bytes (worst case, one byte) to a socket with "write()", and the Nagle algorithm is disabled with TCP_NODELAY, each write becomes one IP packet. This increases traffic by a factor of 40, with IP and TCP headers for each payload. Tinygram prevention won't let you send a second packet if you have one in flight, unless you have enough data to fill the maximum sized packet. It accumulates bytes for one round trip time, then sends everything in the queue. That's almost always what you want. If you have TCP_NODELAY set, you need to be much more aware of buffering and flushing issues.

None of this matters for bulk one-way transfers, which is most HTTP today. (I've never looked at the impact of this on the SSL handshake, where it might matter.)

Short version: set TCP_QUICKACK. If you find a case where that makes things worse, let me know.

John Nagle


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You