More

threeducks · 2025-12-20T12:20:53 1766233253

Lets take the Samsung 9100 Pro M.2 as an example. It has a sequential read rate of ~6700 MB/s and a 4k random read rate of ~80 MB/s:

https://i.imgur.com/t5scCa3.png

https://ssd.userbenchmark.com/ (click on the orange double arrow to view additional columns)

That is a latency of about 50 µs for a random read, compared to 4-5 ms latency for HDDs.

mgerdts · 2025-12-20T16:19:26 1766247566

Datacenter storage will generally not be using M.2 client drives. They employ optimizations that win many benchmarks but sacrifice on consistency multiple dimensions (power loss protection, write performance degrades as they fill, perhaps others).

With SSDs, the write pattern is very important to read performance.

Datacenter and enterprise class drives tend to have a maximum transfer size of 128k, which is seemingly the NAND block size. A block is the thing that needs to be erased before rewriting.

Most drives seem to have an indirection unit size of 4k. If a write is not a multiple of the IU size or not aligned, the drive will have to do a read-modify-write. It is the IU size that is most relevant to filesystem block size.

If a small write happens atop a block that was fully written with one write, a read of that LBA range will lead to at least two NAND reads until garbage collection fixes it.

If all writes are done such that they are 128k aligned, sequential reads will be optimal and with sufficient queue depth random 128k reads may match sequential read speed. Depending on the drive, sequential reads may retain an edge due to the drive’s read ahead. My own benchmarks of gen4 U.2 drives generally backs up these statements.

At these speeds, the OS or app performing buffered reads may lead to reduced speed because cache management becomes relatively expensive. Testing should be done with direct IO using libaio or similar.

OptionOfT · 2025-12-20T14:19:30 1766240370

At the 4K random reads impacted by the fact that you still cannot switch Samsung SSDs to 4K native clusters?

diroussel · 2025-12-20T15:15:31 1766243731

I think that is a bigger impact on writes than reads, but certainly means there is some gap from optimal.

To me a 4k read seems anachronistic from a modern application perspective. But I gather 4kb pages are still common in many file systems. But that doesn’t mean the majority of reads are 4kb random in a real world scenario.

hinkley · 2025-12-21T02:55:28 1766285728

That’s literally faster to do a full table scan below a particular table size.

threeducks · 2025-12-16T09:25:55 1765877155

Without looking at the code, O(N * k) with N = 9000 points and k = 50 dimensions should take in the order of milliseconds, not seconds. Did you profile your code to see whether there is perhaps something that takes an unexpected amount of time?

romanfll · 2025-12-16T12:52:49 1765889569

The '2 seconds' figure comes from the end-to-end time on a standard laptop. I quoted 2s to set realistic expectations for the user experience, not the CPU cycle count. You are right that the core linear algebra (Ax=b) is milliseconds. The bottleneck is the DOM/rendering overhead, but strictly speaking, the math itself is blazing fast.

moralestapia · 2025-12-16T14:23:24 1765895004

This is great Roman, congrats on the amazing work :)!

donkeybeer · 2025-12-16T09:59:13 1765879153

If he wrote the for loop in python instead of numpy or C or whatever it could be a plausible runtime.

jdhwosnhw · 2025-12-16T16:01:34 1765900894

Thats not how big-O notation works. You don’t know what proportionality constants are being hidden by the notation so you cant make any assertions about absolute runtimes

threeducks · 2025-12-16T16:39:13 1765903153

It is true that big-O notation does not necessarily tell you anything about the actual runtime, but if the hidden constant appears suspiciously large, one should double-check whether something else is going on.

yorwba · 2025-12-16T10:27:35 1765880855

Each of the N data points is processed through several expensive linear algebra operations. O(N * k) just expresses that if you double N, the runtime also at most doubles. It doesn't mean it has to be fast in an absolute sense for any particular value of N and k.

akoboldfrying · 2025-12-16T10:51:20 1765882280

Didn't read TFA, but it's hard to think of a linear algebra operation that is both that slow and takes time independent of n and k.

threeducks · 2025-12-05T12:44:39 1764938679

Tail calls can be used for parsing very efficiently: https://news.ycombinator.com/item?id=41289114

threeducks · 2025-12-03T13:52:01 1764769921

I can not say how big ML companies do it, but from personal experience of training vision models, you can absolutely reuse the weights of barely related architectures (add more layers, switch between different normalization layers, switch between separable/full convolution, change activation functions, etc.). Even if the shapes of the weights do not match, just do what you have to do to make them fit (repeat or crop). Of course the models will not work right away, but training will go much faster. I usually get over 10 times faster convergence that way.

sota_pop · 2025-12-04T15:54:56 1764863696

It’s possible the model architecture influences the effectiveness of utilizing pretrained weights. i.e. cnns might be a good fit for this since the first portion is the feature extractor, but you might scrap the decoder and simply retrain that.

Can’t say whether the same would work with Transformer architecture, but I would guess there are some portions that could potentially be reused? (there still exists an encoder/feature extraction portion)

If you’re reusing weights from an existing model, then it seems it becomes more of a “fine-tuning” exercise as opposed to training a novel foundational model.

threeducks · 2025-12-03T11:18:47 1764760727

Why would the open weights providers need their own tools for agentic workflows when you can just plug their OpenAI-compatible API URL into existing tools?

Also, there are many providers of open source models with caching (Moonshot AI, Groq, DeepSeek, FireWorks AI, MiniMax): https://openrouter.ai/docs/guides/best-practices/prompt-cach...

rglullis · 2025-12-03T12:46:24 1764765984

> when you can just plug their OpenAI-compatible API URL into existing tools?

Only the self-hosting diehards will bother with that. Those that want to compete with Claude Code, Gemini CLI, Codex et caterva will have to provide the whole package and do it a price point that is competitive even with low volumes - which is hard to do because the big LLM providers are all subsidizing their offerings.

threeducks · 2025-12-03T11:02:42 1764759762

You need a certain level of batch parallelism to make inference efficient, but you also need enough capacity to handle request floods. Being a small provider is not easy.

threeducks · 2025-12-03T10:40:33 1764758433

I just tried it with GPT-5.1-Codex. The compression ratio is not amazing, so not sure if it really worked, but at least it ran without errors.

A few ideas how to make it work for you:

1. You gave a link to a PDF, but you did not describe how you provided the content of the PDF to the model. It might only have read the text with something like pdftotext, which for this PDF results in a garbled mess. It is safer to convert the pages to PNG (e.g. with pdftoppm) and let the model read it from the pages. A prompt like "Transcribe these pages as markdown." should be sufficient. If you can not see what the model did, there is a chance it made things up.

2. You used C++, but Python is much easier to write. You can tell the model to translate the code to C++ once it works in Python.

3. Tell the model to write unit tests to verify that the individual components work as intended.

4. Use Agent Mode and tell the model to print something and to judge whether the output is sensible, so it can debug the code.

throwaway31131 · 2025-12-03T14:32:41 1764772361

Interesting. Thanks for the suggestions.

threeducks · 2025-12-01T17:40:33 1764610833

> I do wonder if there are any DOS vectors that need to be considered if such a large image can be defined in relatively small byte space.

You can already DOS with SVG images. Usually, the browser tab crashes before worse things happen. Most sites therefore do not allow SVG uploads, except GitHub for some reason.

asddubs · 2025-12-02T07:12:14 1764659534

svg is also just kind of annoying to deal with, because the image may or may not even have a size, and if it does, it can be specified in a bunch of different units, so it's a lot harder to get this if you want to store the size of the image or use it anywhere in your code

threeducks · 2025-12-01T09:21:19 1764580879

Here is an even older comment chain about it from 2020: https://news.ycombinator.com/item?id=23895706

Apparently, comparing low-background steel to pre-LLM text is a rather obvious analogy.

pseidemann · 2025-12-01T09:58:11 1764583091

As well as that people often do think alike.

If you have a thought, it's likely it's not new.

rollulus · 2025-12-01T09:27:58 1764581278

Oh wow, great find! That’s really early days.

threeducks · 2025-11-30T17:00:49 1764522049

Could you explain a bit how the code works? For example, how does it detect the correct pixel size and how does it find out how to color the (potentially misaligned) pixels?

HN For You