Datacenter storage will generally not be using M.2 client drives. They employ optimizations that win many benchmarks but sacrifice on consistency multiple dimensions (power loss protection, write performance degrades as they fill, perhaps others).
With SSDs, the write pattern is very important to read performance.
Datacenter and enterprise class drives tend to have a maximum transfer size of 128k, which is seemingly the NAND block size. A block is the thing that needs to be erased before rewriting.
Most drives seem to have an indirection unit size of 4k. If a write is not a multiple of the IU size or not aligned, the drive will have to do a read-modify-write. It is the IU size that is most relevant to filesystem block size.
If a small write happens atop a block that was fully written with one write, a read of that LBA range will lead to at least two NAND reads until garbage collection fixes it.
If all writes are done such that they are 128k aligned, sequential reads will be optimal and with sufficient queue depth random 128k reads may match sequential read speed. Depending on the drive, sequential reads may retain an edge due to the drive’s read ahead. My own benchmarks of gen4 U.2 drives generally backs up these statements.
At these speeds, the OS or app performing buffered reads may lead to reduced speed because cache management becomes relatively expensive. Testing should be done with direct IO using libaio or similar.
I think that is a bigger impact on writes than reads, but certainly means there is some gap from optimal.
To me a 4k read seems anachronistic from a modern application perspective. But I gather 4kb pages are still common in many file systems. But that doesn’t mean the majority of reads are 4kb random in a real world scenario.
Without looking at the code, O(N * k) with N = 9000 points and k = 50 dimensions should take in the order of milliseconds, not seconds. Did you profile your code to see whether there is perhaps something that takes an unexpected amount of time?
The '2 seconds' figure comes from the end-to-end time on a standard laptop. I quoted 2s to set realistic expectations for the user experience, not the CPU cycle count. You are right that the core linear algebra (Ax=b) is milliseconds. The bottleneck is the DOM/rendering overhead, but strictly speaking, the math itself is blazing fast.
Thats not how big-O notation works. You don’t know what proportionality constants are being hidden by the notation so you cant make any assertions about absolute runtimes
It is true that big-O notation does not necessarily tell you anything about the actual runtime, but if the hidden constant appears suspiciously large, one should double-check whether something else is going on.
Each of the N data points is processed through several expensive linear algebra operations. O(N * k) just expresses that if you double N, the runtime also at most doubles. It doesn't mean it has to be fast in an absolute sense for any particular value of N and k.
I can not say how big ML companies do it, but from personal experience of training vision models, you can absolutely reuse the weights of barely related architectures (add more layers, switch between different normalization layers, switch between separable/full convolution, change activation functions, etc.). Even if the shapes of the weights do not match, just do what you have to do to make them fit (repeat or crop). Of course the models will not work right away, but training will go much faster. I usually get over 10 times faster convergence that way.
It’s possible the model architecture influences the effectiveness of utilizing pretrained weights. i.e. cnns might be a good fit for this since the first portion is the feature extractor, but you might scrap the decoder and simply retrain that.
Can’t say whether the same would work with Transformer architecture, but I would guess there are some portions that could potentially be reused? (there still exists an encoder/feature extraction portion)
If you’re reusing weights from an existing model, then it seems it becomes more of a “fine-tuning” exercise as opposed to training a novel foundational model.
Why would the open weights providers need their own tools for agentic workflows when you can just plug their OpenAI-compatible API URL into existing tools?
> when you can just plug their OpenAI-compatible API URL into existing tools?
Only the self-hosting diehards will bother with that. Those that want to compete with Claude Code, Gemini CLI, Codex et caterva will have to provide the whole package and do it a price point that is competitive even with low volumes - which is hard to do because the big LLM providers are all subsidizing their offerings.
You need a certain level of batch parallelism to make inference efficient, but you also need enough capacity to handle request floods. Being a small provider is not easy.
I just tried it with GPT-5.1-Codex. The compression ratio is not amazing, so not sure if it really worked, but at least it ran without errors.
A few ideas how to make it work for you:
1. You gave a link to a PDF, but you did not describe how you provided the content of the PDF to the model. It might only have read the text with something like pdftotext, which for this PDF results in a garbled mess. It is safer to convert the pages to PNG (e.g. with pdftoppm) and let the model read it from the pages. A prompt like "Transcribe these pages as markdown." should be sufficient. If you can not see what the model did, there is a chance it made things up.
2. You used C++, but Python is much easier to write. You can tell the model to translate the code to C++ once it works in Python.
3. Tell the model to write unit tests to verify that the individual components work as intended.
4. Use Agent Mode and tell the model to print something and to judge whether the output is sensible, so it can debug the code.
> I do wonder if there are any DOS vectors that need to be considered if such a large image can be defined in relatively small byte space.
You can already DOS with SVG images. Usually, the browser tab crashes before worse things happen. Most sites therefore do not allow SVG uploads, except GitHub for some reason.
svg is also just kind of annoying to deal with, because the image may or may not even have a size, and if it does, it can be specified in a bunch of different units, so it's a lot harder to get this if you want to store the size of the image or use it anywhere in your code
Could you explain a bit how the code works? For example, how does it detect the correct pixel size and how does it find out how to color the (potentially misaligned) pixels?
https://i.imgur.com/t5scCa3.png
https://ssd.userbenchmark.com/ (click on the orange double arrow to view additional columns)
That is a latency of about 50 µs for a random read, compared to 4-5 ms latency for HDDs.