For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | simonw's commentsregister

Your comment here appears to be a perfect illustration of what Nilay calls "software brain" in the article.

(I have a strong case of software brain as he describes it myself.)


Hacker News isn't a great place to discuss papers generally.

Having a productive discussion around a paper requires at least reading and understanding the abstract, and the most successful content on HN (sadly) is content where people can jump in with an opinion purely from reading the headline.

Anyone know of any forums that are good for discussing papers?


This is true across all research subject areas (I'm not especially tuned into LLM research but am to cryptography, which also happens to be a field that gets a lot of play on HN). I think it's just a function of how many people conversant in the field are available to talk about it at any one time.

/R/MachineLearning is not bad

But the gold standard is a small signal or discord community of like-minded, fairly tight knit friends. You may have to organize this yourself


There are/were isolated communities on Discord around fast.ai, MLC, MLOps that talk papers more in depth but it’s hard to organize a community without commercial or academic incentive.

The difficulty is perhaps unsurprising given the time sink that is reading a given paper to any reasonably complete degree of understanding.

I just email the authors with questions. Surprisingly high response rate.

Unironically, very niche subreddits.

... and this thread over here seems to be proving me wrong already: https://news.ycombinator.com/item?id=47893779

I wonder if the fact that GPT-5.5 was already available in their Codex-specific API which they had explicitly told people they were allowed to use for other purposes - https://simonwillison.net/2026/Apr/23/gpt-5-5/#the-openclaw-... - accelerated this release!

I've been calling that the "streaming experts" trick, the key idea is to take advantage of Mixture of Expert models where only a subset of the weights are used for each round of calculations, then load those weights from SSD into RAM for each round.

As I understand it if DeepSeek v4 Pro is a 1.6T, 49B active that means you'd need just 49B in memory, so ~100GB at 16 bit or ~50GB at 8bit quantized.

v4 Flash is 284B, 13B active so might even fit in <32GB.


The "active" count is not very meaningful except as a broad measure of sparsity, since the experts in MoE models are chosen per layer. Once you're streaming experts from disk, there's nothing that inherently requires having 49B parameters in memory at once. Of course, the less caching memory does, the higher the performance overhead of fetching from disk.

> ~100GB at 16 bit or ~50GB at 8bit quantized.

V4 is natively mixed FP4 and FP8, so significantly less than that. 50 GB max unquantized.


Ahh, that actually makes more sense now. (As you can tell, I just skimmed through the READMEs and starred "for later".)

My Mac can fit almost 70B (Q3_K_M) in memory at once, so I really need to try this out soon at maybe Q5-ish.


Streaming weights from RAM to GPU for prefill makes sense due to batching and pcie5 x16 is fast enough to make it worthwhile.

Streaming weights from RAM to GPU for decode makes no sense at all because batching requires multiple parallel streams.

Streaming weights from SSD _never_ makes sense because the delta between SSD and RAM is too large. There is no situation where you would not be able to fit a model in RAM and also have useful speeds from SSD.


There have been some very interesting experiments with streaming from SSD recently: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

I don't mean to be a jerk, but 2-bit quant, reducing experts from 10 to 4, who knows if the test is running long enough for the SSD to thermal throttle, and still only getting 5.5 tokens/s does not sound useful to me.

It's a lot more useful than being entirely unable to try out the model.

But you aren't trying out the model. You quantized beyond what people generally say is acceptable, and reduced the number of experts, which these models are not designed for.

Even worse, the github repo advertises:

> Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.

Hiding the fact that active params is _not_ 17B.


It doesn't have to be a 2-bit quant - see the update at the bottom of my post:

> Update: Dan's latest version upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.

That was also just the first version of this pattern that I encountered, it's since seen a bunch of additional activity from other developers in other projects.

I linked to some of those in this follow-up: https://simonwillison.net/2026/Mar/24/streaming-experts/


On Apple Silicon Macs, the RAM is shared. So while maybe not up to raw GPU VRAM speeds, it still manages over 450GB/s real world on M4 Pro/Max series, to any place that it is needed.

They all do have a limitation from the SSD, but the Apple SSDs can do over 17GB/s (on high end models, the more normal ones are around 8GB/s)


Yeah, I am mostly only talking about the SSD bottleneck being too slow. No way Apple gets 17GB/s sustained. SSDs thermally throttle really fast, and you have some random access involved when it needs the next expert.

Unsloth often turn them around within a few hours, they might have gone to bed already though!

Keep an eye on https://huggingface.co/unsloth/models

Update ten minutes later: https://huggingface.co/unsloth/DeepSeek-V4-Pro just appeared but doesn't have files in yet, so they are clearly awake and pushing updates.



Those are quants, not distills.

The Flash one should - it's 160GB on Hugging Face: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/ma...

So, dual RTX PRO 6000

I like the pelican I got out of deepseek-v4-flash more than the one I got from deepseek-v4-pro.

https://simonwillison.net/2026/Apr/24/deepseek-v4/

Both generated using OpenRouter.

For comparison, here's what I got from DeepSeek 3.2 back in December: https://simonwillison.net/2025/Dec/1/deepseek-v32/

And DeepSeek 3.1 in August: https://simonwillison.net/2025/Aug/22/deepseek-31/

And DeepSeek v3-0324 in March last year: https://simonwillison.net/2025/Mar/24/deepseek/


No way. The Pro pelican is fatter, has a customized front fork, and the sun is shining! He’s definitely living the best life.

The pro pelican is a work of art! It goes dimensions that no other LLM has gone before.

yeah. look at these 4 feathers (?) on his bum too.

a lot of dumplings

The Flash one is pretty impressive. Might be my favorite so far in the pelican-riding-a-bicycle series

This is just a random thought, but have you tried doing an 'agentic' pelican?

As in have the model consider its generated SVG, and gradually refine it, using its knowledge of the relative positions and proportions of the shapes generated, and have it spin for a while, and hopefully the end result will be better than just oneshotting it.

Or maybe going even one step further - most modern models have tool use and image recognition capabilities - what if you have it generate an SVG (or parts/layers of it, as per the model's discretion) and feed it back to itself via image recognition, and then improve on the result.

I think it'd be interesting to see, as for a lot of models, their oneshot capability in coding is not necessarily corellated with their in-harness ability, the latter which really matters.


I tried that for the GPT-5 launch - a self-improving loop that renders the SVG, looks at it and tries again - and the results were surprisingly disappointing.

I should try it again with the more recent models.


I see, thanks. I guess most current models are not yet trained for this loop.

Could you please try with Opus 4.7? I think there's a chance of it doing well, considering the design/vision focus.


DeepSeek pelicans are the angriest pelicans I’ve seen so far.

they're just late for work.

They're stressed pelicans from Hangzhou.

996 Pelican, lol

Being a bicycle geometry nerd I always look at the bicycle first.

Let me tell you how much the Pro one sucks... It looks like failed Pedersen[1]. The rear wheel intersects with the bottom bracket, so it wouldn't even roll. Or rather, this bike couldn't exist.

The flash one looks surprisingly correct with some wild fork offset and the slackest of seat tubes. It's got some lowrider[2] aspirations with the small wheels, but with longer, Rivendellish[3], chainstays. The seat post has different angle than the seat tube, so good luck lowering that.

[1] https://en.wikipedia.org/wiki/Pedersen_bicycle

[2] https://en.wikipedia.org/wiki/Lowrider_bicycle

[3] https://www.rivbike.com/


This is an excellent comment. Thanks for this - I've only ever thought about whether the frame is the right shape, I never thought about how different illustrations might map to different bicycle categories.

Some other reactions:

I wonder which model will try some more common spoke lacing patterns. Right now there seems to be a preference for radial lacing, which is not super common (but simple to draw). The Flash and Pro one uses 16 spoke rims, which actually exist[1] but are not super common.

The Pro model fails badly at the spokes. Heck, the spokes sit on the outside of the drive side of the rim and tire. Have a nice ride riding on the spokes (instead of the tire) welded to the side of your rim.

Both bikes have the drive side on the left, which is very very uncommon. That can't exist in the training data.

[1] https://cicli-berlinetta.com/product/campagnolo-shamal-16-sp...


The Pedersen looks like someone failed the "draw a bicycle" test and decided to adjust the universe.

I think the pelican on a bike is known widely enough that of seizes to be useful as a benchmark. There is even a pelican briefly appearing in the promo video of GPT-5, if I'm not mistaken https://openai.com/gpt-5/. So the companies are apparently aware of it.

It was a bigger deal in the Gemini 3.1 launch: https://x.com/JeffDean/status/2024525132266688757

To me this is the perfect proof that

1) LLM is not AGI. Because surely if AGI it would imply that pro would do better than flash?

2) and because of the above, Pelican example is most likely already being benchmaxxed.


What was your prompt for the image? Apologies if this should be obvious.

>Generate an SVG of a pelican riding a bicycle

at the top of the linked pages.


Is it then Deepseek hosted by Deepseek?

How much does the drawing change if you ask it again?


I really like the pro version. The pelican is so cute.

Where is the GPT 5.5 Pelican?


In the 5.5 topic

Why they so angry?

[flagged]


It's just Simon Willison (the person you are replying to) who always makes a pelican, as his personal flippant benchmark. It's not that deep.

No benchmark will be perfect, especially if it's public but it's a fun experiment to visually see how these models get better and better.

Why is it so wrong?

Thanks for the "scientific air" remark, that gave me a genuine LOL.

"The difference between screwing around and science is writing it down" -- Adam Savage

This should not be the top comment on every model release post. It's getting tiring.

This should be the bottom comment on the pelican comment on every model release post.

Clearly the top comment should be "Imagine a beowulf cluster of Deepseek v4!"

My mother was murdered by Beowulf, you insensitive Claude!

This was perfect.

This doesn't have API access yet, but OpenAI seem to approve of the Codex API backdoor used by OpenClaw these days... https://twitter.com/steipete/status/2046775849769148838 and https://twitter.com/romainhuet/status/2038699202834841962

And that backdoor API has GPT-5.5.

So here's a pelican: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex

UPDATE: I got a much better pelican by setting the reasoning effort to xhigh: https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602...


OpenAI hired the guy behind OpenClaw, so it makes sense that they’re more lenient towards its usage.

They basically bought OpenClaw right?

I believe the technical term is "acquihire"

That pelican you posted yesterday from a local model looks nicer than this one.

Edit: this one has crossed legs lol


It really needs to pee.

Isn't it awful ? After 5.5 versions it still can't draw a basic bike frame. How is the front wheel supposed to turn sideways ?

I feel like if I attempted this, the bike frame would look fine and everything else would be completely unrecognizable. After all, a basic bike frame is just straight lines arranged in a fairly simple shape. It's really surprising that models find it so difficult, but they can make a pelican with panache.

> a fairly simple shape

Bike frames are very hard to draw unless you've already consciously internalized the basic shape, see https://www.booooooom.com/2016/05/09/bicycles-built-based-on...


Humans are also famously bad at drawing bicycles from memory https://www.gianlucagimini.it/portfolio-item/velocipedia/

why do you find it surprising? these models have no actual understanding of anything, never mind the physical properties and capabilities of a bicycle.

Sad to see this downvoted. So many people think that LLM have understanding?

My question is, as a human, how well would you or I do under the same conditions? Which is to say, I could do a much better job in inkscape with Google images to back me up, but if I was blindly shitting vectors into an XML file that I can't render to see the results of, I'm not even going to get the triangles for the frame to line up, so this pelican is very impressive!

Yeah, the bike frame is the thing I always look at first - it's still reasonably rare for a model to draw that correctly, although Qwen 3.6 and Gemini Pro 3.1 do that well now.

The distinction is that it's not drawing. It's generating an SVG document containing descriptors of the shapes.

I made pelicans at different thinking efforts:

https://hcker.news/pelican-low.svg

https://hcker.news/pelican-medium.svg

https://hcker.news/pelican-high.svg

https://hcker.news/pelican-xhigh.svg

Someone needs to make a pelican arena, I have no idea if these are considered good or not.


They are not good, and they seem to get worse as you increased effort. Weird

Yeah. I've always loosely correlated pelican quality with big model smell but I'm not picking that up here. I thought this was supposed to be spud? Weird indeed.

No but I can sense the movement, I think it's already reached the level of intelligence that draws it towards futurism or cubism /s

Can someone explain how we arrived at the pelican test? Was there some actual theory behind why it's difficult to produce? Or did someone just think it up, discover it was consistently difficult, and now we just all know it's a good test?

I set it up as a joke, to make fun of all of the other benchmarks. To my surprise it ended up being a surprisingly good measure of the quality of the model for other tasks (up to a certain point at least), though I've never seen a convincing argument as to why.

I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/

It should not be treated as a serious benchmark.


What it has going for it is human interpretability.

Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.


how can you say "it ended up being a surprisingly good measure of the quality of the model for other tasks" and also "It should not be treated as a serious benchmark" in the same comment?

if it is indeed a good measure of the quality of the model (hint: it's not) then, logically, it should be taken seriously.

this is, sadly, a great example of the kind of doublethink the "AI" hypesters (yes - whether you like it or not simon - that is what you are now) are all too capable of.


I genuinely don't see how those two statements conflict with each other.

Despite not being a serious benchmark (how could it be serious? It's a pelican riding a bicycle!) it still turned out to have some value. You can see that just by scrolling through the archives and watching it improve as the models improved.

If your definition of doublethink is "holding two conflicting ideas in your head at once" then I would say doublethink is a necessary skill for navigating the weird AI era we find ourselves inhabiting.


"some value" is not the same as "a surprisingly good measure of the quality of the model for other tasks".

doublethink does not mean holding two conflicting ideas in your head at once. it means holding two logically inconsistent positions/beliefs at the same time.


It all began with a Microsoft researcher showing a unicorn drawn in tikz using GPT4. It was an example of something so outrageous that there was no way it existed in the training data. And that's back when models were not multimodal.

Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.



It's interesting to see some semblance of spatial reasoning emerge from systems based on textual tokens. Could be seen as a potential proxy for other desirable traits.

It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.

If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.


I tried getting it to generate openscad models, which seems much harder. Not had much joy yet with results.

G code and ascii art are also text formats, but seem to be beyond most if not all models.

(There are some that generate 3d models specifically, more in the image generation family than chatbot family.)


None of them have the pelican's feet placed properly on the pedals -- or the pedals are misrepresented. Cool art style but not physically accurate.

I'm not sure a physically accurate pelican would reach two pedals on a common bicycle. Maybe a model can solve that problem one day.


It's... like no pelican I've ever seen before.

You've never seen pelicans riding bicycles either so maybe these are just representations of those specific subgroups of pelicans which are capable of riding them. Normal pelicans would not feel the need to ride bikes since they can fly, these special pelicans mostly seem to lack the equipment needed to do that which might be part of the reason they evolved to ride two-wheeled pedal-propelled vehicles.

The pelican doesn’t really matter anymore since models are tuned for it knowing people will ask.

They suck at tuning for it.

Is this direct API usage allowed by their terms? I remember Anthropic really not liking such usage.


That's amazing that the default did that much in just 39 "reasoning tokens" (no idea what a reasoning token is but that's still shockingly few tokens)

If you don't know what a reasoning token is, then how can 39 be considered shockingly few?

It's less than 67, duh.

Not during peak hours.

Hmm. Any idea why it's so much worse than the other ones you have posted lately? Even the open weight local models were much better, like the Qwen one you posted yesterday.

The xhigh one was better, but clearly OpenAI have not been focusing their training efforts on SVG illustrations of animals riding modes of transport!

It beats opus-4.7 but looks like open models actually have the lead here.

So pelican must have become the mandatory test case to pass for all model providers before launch.

Thank you for doing all this. It's appreciated.

You do realise they are doing it for self promotion right?

I mean, yeah. "Person who spends time publishing content online is doing it for self promotion" doesn't seem particularly notable to me. 24 years of self promotion and counting!

Dude it comes across, maybe only to me, as a bit shameless. Or maybe it's just that there are so many people lapping it up like you're doing a public service that I find tedious. I wish hackernews had a block feature but alas it doesn't. Maybe I'll vibecode a browser extension.

I am always outraged when youtube creators ask me to like and subscribe. /s

Not the same at all. For that to happen you would have to explicitly visit their channel (forgive incorrect terminology, I don't use youtube). If someone kept posting on hackernews asking you to subscribe I hope you wouldn't appreciate it. swillison is spamming a communal public feed with self promotional comments about vibe coding, quite obviously because they, like the rest of us, are panicking about not having a career in a few years.

The more time I spend actually working with these tools the less I fear for my future career.

Building software remains really hard. Most people are not going to be able to produce production quality software systems, no matter how good the AI tooling gets.


Conversely, if the models ever make it to the point where they can replace ~all developers we will presumably have achieved AGI or even ASI and all other jobs will also be eliminated more or less simultaneously. So at least we'll all be in good company (and there probably won't be much point to marketing yourself in that case).

Forums traditionally included signature blocks at the end of messages. If someone linked his youtube channel there would that be objectionable? Assuming the preceding message was on point of course.

Posts on HN are analogous to videos on youtube. A channel is analogous to an HN user profile.


Wait, I thought we were onto racoons on e-scooters to avoid (some of) the issues with Goodhart's Law coming into play.

I fall back to possums on e-scooters if the pelican looks too good to be true. These aren't good enough for me to suspect any fowl play.

what is your setup for drawing pelican? Do you ask model to check generated image, find issues and iterate over it which would demonstrate models real abilities?

It's generally one-shot-only - whatever comes out the first time is what I go with.

I've been contemplating a more fair version where each model gets 3-5 attempts and then can select which rendered image is "best".


Try llm-consortium with --judging-method rank

I think it will make results way better and more representative of model abilities..

It would... but the test is inherently silly, so I'm still not sure if it's worth me investing that extra effort in it.

I for one delight in bicycles where neither wheel can turn!

It continues to amaze me that these models that definitely know what bicycle geometry actually looks like somewhere in their weights produces such implausibly bad geometry.

Also mildly interesting, and generally consistent with my experience with LLMs, that it produced the same obvious geometry issue both times.


> It continues to amaze me that these models that definitely know what bicycle geometry actually looks like somewhere in their weights produces such implausibly bad geometry.

I feel like the main problem for the models is that they can't actually look at the visual output produced by their SVG and iterate. I'm almost willing to bet that if they could, they'd absolutely nail it at this point.

Imagine designing an SVG yourself without being able to ever look outside the XML editor!


> Imagine designing an SVG yourself without being able to ever look outside the XML editor!

I honestly think I could do much better on the bicycle without looking at the output (with some assistance for SVG syntax which I definitely don't know), just as someone who rides them and generally knows what the parts are.

I'd do worse at the pelicans though.


Thank you for continuing to post these! Very interesting benchmark.

Does OpenAI actually act open for once here, and allow using their model via a subscription over Anthrophic banning use in Openclaw?

That's what they said on Twitter.

Exciting. Another Pelican post.

It's silly and a joke and a surprisingly good benchmark and don't take it seriously but don't take not taking it seriously seriously and if it's too good we use another prompt and there's obvious ways to better it and it's not worth doing because it's not serious and if you say anything at all about the thread it's off-topic so you're doing exactly what you're complaining about and it's a personal attack from the fun police.

Only coherent move at this point: hit the minus button immediately. There's never anything about the model in the thread other than simon's post.


See if you can spot what's interesting and unique about this one. I've been trying to put more than just a pelican in there, partly as a nod to people who are getting bored of them.

At some point, OpenAI is going to cheat and hardcode a pelican on a bicycle into the model. 3D modelling has Suzanne and the teapot; LLMs will have the pelican.

You know they are 1000% training these models to draw pelicans, this hasn't been a valid benchmark for 6 months +

OpenAI must be very bad at training models to draw pelicans (and bicycles) then.

Skeptism is out of control these days, any time an LLM does something cool it must have been cheating.

they legitimately suck at everything they don't have concrete examples to copy from.

Fear of AI companies "slurping up data" being used as a rationale for not sharing anything is one of the most underrated harms of the whole current AI mess.

It’s not a fear. It’s reality. It’s literally happening on HN right now.

Take this game, for example: https://news.ycombinator.com/item?id=47698455

Within an hour, someone had cloned the game with addition mechanics that multiple people mentioned they like more: https://news.ycombinator.com/item?id=47729573


That's not an AI company "slurping up data", that's someone using AI tools to accelerate their own personal clone of a project.

I think you're missing the point. The game (no pun intended?) has changed. Working with the garage door up has become a liability.

Doesn't feel particularly different to me, I've been publishing my side projects as open source code on GitHub for over a decade.

The effort required to adapt them has dropped, but I've always exposed them to being adapted.


> Doesn't feel particularly different to me

> The effort required to adapt them has dropped

AI is an entirely different situation because the effort required to copy has dropped by multiple orders of magnitude. You used to be able to build in the open without worrying about copycats because the vast majority of people didn’t want to spend the effort. Now (with AI), even someone with the slightest, most fleeting whim can copy your work.

It’s great that you’re open to being adapted. There’s nothing wrong with that. But if you’re not open to having your ideas outright taken, then it’s not safe to build in the open any longer.


If I cared about people copying my projects and ideas I wouldn't put them on GitHub with a liberal open source license.

It has been known (especially in gamedev circles) that ideas are not worth much. I don't like AI slop, but what's the harm of taking someone's demo and making it better? Then someone else can do the same, and tweak some other mechanic.

no we got something better out of it

I read simonw's comment not as dismissing the reality, but rather highlighting the harm of discouraging sharing.

The slurping can be both real and the induced reluctance to share a harm.


Why is that a bad thing? Person 1 built a thing, and then someone came along and made it better? It's a game, so better is subjective, but should ideas only ever come from Person 1, while everyone else just gazes upon them with slack jawed awe, unable to contribute?

I completely agree. Honestly I wish we could go back to before AI. I don't like where it's taking us at all. Changing how we write code is just the beginning. Next we'll be replacing humans altogether. I've already had an interview with a soulless "AI recruiter" bot. We can't go back now of course, but one can dream.

It can act as an in-process database, like SQLite. You can import the library directly into your code.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You