More

desideratum · 2026-04-06T10:09:27 1775470167

Yes my findings and thoughts were pretty much identical. I actually think you can get something reasonable at 1.3B params with the correct training recipe, but definitely not at this compute/token budget.

One thing I found was that the model would pretty much always emit solutions from its training data when asked to solve problems, but it was much better at using Bash commands to explore a codebase, for example.

The Hugging Face folks have a great post on also using CAI for more vibes/character post-training than harmlessness https://huggingface.co/blog/constitutional_ai#oh-honey-lets-...

desideratum · 2026-04-05T20:26:37 1775420797

This is a gross simplification of the process - you would typically use order(s) of magnitude more data and compute, and a substantial amount of online reinforcement learning to elicit emergent tool use capabilities.

Many recent OSS models have great tech reports where you can learn more about these kind of things: Kimi 2.5 https://github.com/MoonshotAI/Kimi-K2.5/blob/master/tech_rep... GLM 5 https://arxiv.org/abs/2602.15763 DeepSeek R1 https://arxiv.org/pdf/2501.12948

desideratum · 2026-04-05T20:21:54 1775420514

I appreciate the kind words very much : )

desideratum · 2026-04-05T20:21:36 1775420496

I see what you mean, but I disagree. I expect that Claude Code is backed by a separate post-train of Claude base which has been trained using the Claude Code harness and toolset.

vova_hn2 · 2026-04-05T20:29:38 1775420978

It is possible of course, but I see no reason to believe it.

jasonjmcghee · 2026-04-05T22:08:56 1775426936

fwiw, other models seem to / are reported to struggle much more with using claude code compared with codex / opencode / pi etc.

that being said, there are other potential explanations

desideratum · 2026-04-05T20:17:03 1775420223

Oh I wouldn't be surprised. This is a sample from one of the OSS code datasets I'd used, which are all generated synthetically using LLMs. Data is indeed the moat.

desideratum · 2026-04-05T16:30:46 1775406646

This is a great question. You definitely aren't training this to use it, you're training it to understand how things work. It's an educational project, if you're interested in experimenting with things like distributed training techniques in JAX, or preference optimisation, this gives you a minimal and hackable library to build on.

wongarsu · 2026-04-05T18:54:24 1775415264

It's also a great base for experimentation. If you have an idea for an architecture improvement you can try it for $36 on the 20 layer nanocode setting, then for another $200 see how it holds up on the "full scale" nanocode

Kaparthy's notes on improving nanochat [1] are one of my favorite blog-like things to read. Really neat to see which features have how much influence, and how the scaling laws evolve as you improve the architecture

There's also modded-nanogpt which turns the same kind of experimentation into a training speedrun (and maybe loses some rigor on the way) [2]

1 https://github.com/karpathy/nanochat/blob/master/dev/LOG.md

2 https://github.com/kellerjordan/modded-nanogpt

desideratum · 2026-03-03T14:46:37 1772549197

It's mind-boggling that Apple is considering the base 27 inch Studio Display with the same 4 year old panel, but with some new accessories slapped on an "upgrade".

kllrnohj · 2026-03-03T17:45:54 1772559954

The base 27" wasn't even a new display 4 years ago, it's the same thing they were shipping in iMacs before that. It dates back to like 2017?

raydev · 2026-03-04T04:27:06 1772598426

The 5k iMac was introduced in 2014. There was one change in 2015 that added P3 color gamut, so it appears to have been the exact same LG-manufactured panel for at least 11 years.

desideratum · 2026-03-03T14:55:17 1772549717

Oh, and if you want to utilize 120Hz on the XDR display, you're going to have to replace your perfectly functioning Mac.

> Mac models with M1, M1 Pro, M1 Max, M1 Ultra, M2, and M3 support Studio Display XDR at up to 60Hz. All other Studio Display XDR features are supported.

cosmic_cheese · 2026-03-03T15:05:15 1772550315

Almost certainly due to bandwidth limitations on older versions of Thunderbolt. Full bit depth HDR 5k @ 120hz requires some absurd data thoughput.

realityking · 2026-03-03T15:52:19 1772553139

I don’t think so. My M3 Pro is on the list as supporting 120 hz but it only has Thunderbolt 4.

Also the base M4 doesn’t habe Thunderbolt 5 and it support 120 hz.

strongpigeon · 2026-03-03T16:54:44 1772556884

> My M3 Pro is on the list as supporting 120 hz

Can you point me to said list? All I could find was:

> Mac models with M1, M1 Pro, M1 Max, M1 Ultra, M2, and M3 support Studio Display XDR at up to 60Hz. All other Studio Display XDR features are supported.

And The Verge reports:

> There’s also support for adaptive sync that can adjust between 47Hz and 120Hz (if it’s connected to an M4 Mac or later, or the M5 iPad Pro)

I got an M3 Max and was strongly considering upgrading my old monitor, but if I can't do 120hz, I'll just wait until I upgrade my laptop as well.

klardotsh · 2026-03-03T17:45:49 1772559949

I’ll give you an anecdote: my work laptop is an M3 Pro MBP, and my Dell U4025QW works just fine with it over Thunderbolt at 120Hz VRR

FinnKuhn · 2026-03-03T19:04:54 1772564694

That monitor has a noticeable lower pixel count.

Dell U4025QW: 5120 x 2160 = 11,059,200 vs Apple Studio Display XDR: 5120 x 2880 = 14,745,600

So your display has 25% less pixels.

archagon · 2026-03-03T18:41:59 1772563319

It’s quite possible this is running with a reduced color space (chroma subsampling). Degradation happens automatically based on available throughput and most people don’t notice.

jasomill · 2026-03-03T19:28:21 1772566101

For desktop use? Chroma subsampling is obvious. DSC compression, on the other hand, is not. DisplayPort and HDMI support both.

archagon · 2026-03-03T19:32:52 1772566372

It’s obvious if you use a test pattern and/or know what to look for: https://testufo.com/chroma

I had no idea what it was for the longest time. As it turns out, macOS frequently enables it even when it’s unnecessary, and without any way to override.

realityking · 2026-03-03T22:27:29 1772576849

> Can you point me to said list?

There’s no list per-se. The MacBook Pro (2021 and later) is listed as supported. The M3 Pro and M3 Max are not listed as only supporting 60Hz while the M3 and M1 Pro are.

radley · 2026-03-03T17:19:41 1772558381

They did say M3, not M3 Pro. You're probably okay.

(Notice how they listed the M1 chips individually.)

kubik369 · 2026-03-03T15:15:54 1772550954

I don't really see your point. The chips mentioned do not have enough bandwidth on display outputs to support the monitor at 6K@120Hz. If anything, I find it surprising that Apple supports running the display in 60Hz mode instead of telling people to go pound sand and buy new Macs.

desideratum · 2025-12-06T16:25:27 1765038327

The Scaling ML textbook also has an excellent section on TPUs. https://jax-ml.github.io/scaling-book/tpus/

jauntywundrkind · 2025-12-06T17:56:13 1765043773

I also enjoyed https://henryhmko.github.io/posts/tpu/tpu.html https://news.ycombinator.com/item?id=44342977 .

The work that XLA & schedulers are doing here is wildly impressive.

This feels so much drastically harder to work with than Itanium must have been. ~400bit VLIW, across extremely diverse execution units. The workload is different, it's not general purpose, but still awe inspiring to know not just that they built the chip but that the software folks can actually use such a wildly weird beast.

I wish we saw more industry uptake for XLA. Uptakes not bad, per-se: there's a bunch of different hardware it can target! But what amazing secret sauce, it's open source, and it doesn't feel like there's the industry rally behind it it deserves. It feels like Nvidia is only barely beginning to catch up, to dig a new moat, with the just announced Nvidia Tiles. Such huge overlap. Afaik, please correct if wrong, but XLA isn't at present particularly useful at scheduling across machines, is it? https://github.com/openxla/xla

alevskaya · 2025-12-06T18:54:39 1765047279

I do think it's a lot simpler than the problem Itanium was trying to solve. Neural nets are just way more regular in nature, even with block sparsity, compared to generic consumer pointer-hopping code. I wouldn't call it "easy", but we've found that writing performant NN kernels for a VLIW architecture chip is in practice a lot more straightforward than other architectures.

JAX/XLA does offer some really nice tools for doing automated sharding of models across devices, but for really large performance-optimized models we often handle the comms stuff manually, similar in spirit to MPI.

jauntywundrkind · 2025-12-06T21:19:49 1765055989

I agree with regards to the actual work being done by the systolic arrays, which sort of are VLIW-ish & have a predictable plannable workflow for them. Not easy, but there's a very direct path to actually executing these NN kernels. The article does an excellent job setting up how great at win it is that the systolic MXU's can do the work, don't need anything but local registers and local communication across cells, don't need much control.

But if you make it 2900 words through this 9000 word document, to the "Sample VLIW Instructions" and "Simplified TPU Instruction Overlay" diagrams, trying to map the VLIW slots ("They contain slots for 2 scalar, 4 vector, 2 matrix, 1 miscellaneous, and 6 immediate instructions") to useful work one can do seems incredibly incredible challenging. Given the vast disparity of functionality and style of the attached units that that governs, and given the extreme complexity in keeping that MXU constantly fed, keeping very tight timing so that it is constantly well utilized.

> Subsystems operate with different latencies: scalar arithmetic might take single digit cycles, vector arithmetic 10s, and matrix multiplies 100s. DMAs, VMEM loads/stores, FIFO buffer fill/drain, etc. all must be coordinated with precise timing.

Where-as Itanium's compilers needed to pack parallel work into a single instruction, there's maybe less need for that here. But that quote there feels like an incredible heart of the machine challenge, to write instruction bundles that are going to feed a variety of systems all at once, when these systems have such drastically different performance profiles / pipeline depths. Truly an awe-some system, IMO.

Still though, yes: Itanium's software teams did have an incredibly hard challenge finding enough work at compile time to pack into instructions. Maybe it was a harder task. What a marvel modern cores are, having almost a dozen execution units that cpu control can juggle and keep utilized, analyzing incoming instructions on the fly, with deep out-of-order depenency-tracking insight. Trying to figure it all out ahead of time & packing it into the instructions apriori was a wildly hard task.

desideratum · 2025-12-06T18:42:29 1765046549

Thanks for sharing this. I agree w.r.t. XLA. I've been moving to JAX after many years of using torch and XLA is kind of magic. I think torch.compile has quite a lot of catching up to do.

> XLA isn't at present particularly useful at scheduling across machines,

I'm not sure if you mean compiler-based distributed optimizations, but JAX does this with XLA: https://docs.jax.dev/en/latest/notebooks/Distributed_arrays_...

cpgxiii · 2025-12-06T19:20:55 1765048855

In Itanium's heyday, the compilers and libraries were pretty good at handling HPC workloads, which is really the closest anyone was running then to modern NN training/inference. The problem with Itanium and its compilers was that people obviously wanted to run workloads that looked nothing like HPC (databases, web servers, etc) and the architecture and compilers weren't very good at that. There have always been very successful VLIW-style architectures in more specialized domains (graphics, HPC, DSP, now NPU) it just hasn't worked out well for general-purpose processors.

jauntywundrkind · 2025-12-07T00:04:23 1765065863

Side note, just ran into this article that mentions how Amazon is planning to have XLA / JAX support in the future for their Trainium's. https://newsletter.semianalysis.com/p/aws-trainium3-deep-div...

desideratum · 2025-11-29T15:36:11 1764430571

Aside: this guy regularly posts on the Discord server for an open-source post-training framework I maintain, demanding repayment for bugs in nightly builds and generally abusing the maintainers.

immibis · 2025-11-30T10:07:45 1764497265

I assume you offered him to buy a support contract or get banned. Otherwise why is he still allowed to do that?

desideratum · on Jan 25, 2025

This is an exceptional salary for the UK.

pbalau · on Jan 25, 2025

Plus, jobs like this are for people that have a connection with the organization. I can't see myself doing anything, however good the pay might be, for Arsenal. Would be very glad to do that for Chelsea.

lifeisstillgood · on Jan 25, 2025

Oh that’s a fascinating reaction I never thought of. I mean people might refuse to work for any gambling company or any arms maker, but I cannot I imagine someone offered a job in banking refuse say Goldman but take JP Morgan, simply because their family have been Morgan fans for generations and would never accept any other bank …

pbalau · on Jan 25, 2025

Not sure if you intentionally obtuse or simply don't get it. There is no reason to be a fan of an organization whose main goal is to be making money, but there are plenty of reasons to be a fan of an organization whose main goal is to be better at a sport than other similar organizations.

atq2119 · on Jan 25, 2025

I don't think the distinction is about whether the organizations make money; the distinction is about entertainment. Professional sports are primarily a form of entertainment. Sports fans are a bit like music fans in that regard. The rivalries in sports can get toxic sometimes, but there's weird snobbery in music and other arts as well.

addandsubtract · on Jan 25, 2025

There's no reason to be a fan of an organization whose main goal is to make money, yet GP said he's a Chelsea fan.

idiotsecant · on Jan 25, 2025

You don't think professional soccer teams are an organization whose main goal is to make money?

dambi0 · on Jan 25, 2025

If it is, and it probably isn't, they are really not very good at it.

alt227 · on Jan 25, 2025

So they are looking for people in the Venn diagram cross section of AI researchers and Arsenal FC supporters, of which I imagine there are very few.

druskacik · on Jan 25, 2025

You underestimate how much being a soccer fan means, especially in countries like UK.

dh2022 · on Jan 25, 2025

I met some Americans who would never work for the New England Patriots.

shrikant · on Jan 25, 2025

Having done a bit of consulting for Chelsea FC, I wouldn't recommend working there in an office job. Poor pay (I doubt anyone below "Head of"/CxO even touches 6 figures) and very average working conditions.

ksynwa · on Jan 25, 2025

I thought at first that you might be bantering but after reading your last sentence I am not sure anymore. I don't think being associated with Arsenal at this posting is ever going to be a blemish on anyone's CV. They are higher than Chelsea in PL standings at this moment.

pbalau · on Jan 25, 2025

Why bantering?

150k a year puts you in the 1% earners in UK, plenty to comfortable live on.

As you mentioned, having such a job in ones CV can only help.

But, I've been a Chelsea fan for most of my life, if I take such a job at Arsenal and do it properly, I'll be actively working against the organization that brought me so many emotions over the past 30ish years.

haliskerbas · on Jan 25, 2025

Aren’t they both owned by pompous rich dudes and private equity, like most companies? Tasty is the boot we enjoy licking.

Zenul_Abidin · on Jan 25, 2025

Arsenal's owned by Stan Kroenke. Chelsea by Todd Boehly the LA Dodgers guy.

vasco · on Jan 25, 2025

Everything is owned by rich people, is your solution to not enjoy anything?

andrepd · on Jan 25, 2025

Actually in countries like Portugal or Germany, most clubs are owned by its members (or at the very least, 51% owned by its members), which can e.g. vote on its president.

superfrank · on Jan 25, 2025

You're not totally wrong about the 51% thing (although it's really 50% + 1 vote), but it's not like that is a panacea that keeps out corporate interests. Leverkusen and Wolfsberg are owned by Bayer and Volkswagen respectively (although I understand there's historical reasons for that) and Leipzig is 99% owned by Red Bull, but they only have 50% - 1 voting rights to comply.

lifeisstillgood · on Jan 25, 2025

Yes I think that’s just in the top 1% of salaries

But I’m pretty sure Brad got Jonah Hill for much less.

BowBun · on Jan 25, 2025

Surely you mean Scorcese got Jonah??

kgen · on Jan 25, 2025

I think he means in Moneyball

HN For You