For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | more threeducks's commentsregister

Lets take the Samsung 9100 Pro M.2 as an example. It has a sequential read rate of ~6700 MB/s and a 4k random read rate of ~80 MB/s:

https://i.imgur.com/t5scCa3.png

https://ssd.userbenchmark.com/ (click on the orange double arrow to view additional columns)

That is a latency of about 50 µs for a random read, compared to 4-5 ms latency for HDDs.


Datacenter storage will generally not be using M.2 client drives. They employ optimizations that win many benchmarks but sacrifice on consistency multiple dimensions (power loss protection, write performance degrades as they fill, perhaps others).

With SSDs, the write pattern is very important to read performance.

Datacenter and enterprise class drives tend to have a maximum transfer size of 128k, which is seemingly the NAND block size. A block is the thing that needs to be erased before rewriting.

Most drives seem to have an indirection unit size of 4k. If a write is not a multiple of the IU size or not aligned, the drive will have to do a read-modify-write. It is the IU size that is most relevant to filesystem block size.

If a small write happens atop a block that was fully written with one write, a read of that LBA range will lead to at least two NAND reads until garbage collection fixes it.

If all writes are done such that they are 128k aligned, sequential reads will be optimal and with sufficient queue depth random 128k reads may match sequential read speed. Depending on the drive, sequential reads may retain an edge due to the drive’s read ahead. My own benchmarks of gen4 U.2 drives generally backs up these statements.

At these speeds, the OS or app performing buffered reads may lead to reduced speed because cache management becomes relatively expensive. Testing should be done with direct IO using libaio or similar.


At the 4K random reads impacted by the fact that you still cannot switch Samsung SSDs to 4K native clusters?


I think that is a bigger impact on writes than reads, but certainly means there is some gap from optimal.

To me a 4k read seems anachronistic from a modern application perspective. But I gather 4kb pages are still common in many file systems. But that doesn’t mean the majority of reads are 4kb random in a real world scenario.


That’s literally faster to do a full table scan below a particular table size.


Without looking at the code, O(N * k) with N = 9000 points and k = 50 dimensions should take in the order of milliseconds, not seconds. Did you profile your code to see whether there is perhaps something that takes an unexpected amount of time?


The '2 seconds' figure comes from the end-to-end time on a standard laptop. I quoted 2s to set realistic expectations for the user experience, not the CPU cycle count. You are right that the core linear algebra (Ax=b) is milliseconds. The bottleneck is the DOM/rendering overhead, but strictly speaking, the math itself is blazing fast.


This is great Roman, congrats on the amazing work :)!


If he wrote the for loop in python instead of numpy or C or whatever it could be a plausible runtime.


Thats not how big-O notation works. You don’t know what proportionality constants are being hidden by the notation so you cant make any assertions about absolute runtimes


It is true that big-O notation does not necessarily tell you anything about the actual runtime, but if the hidden constant appears suspiciously large, one should double-check whether something else is going on.


Each of the N data points is processed through several expensive linear algebra operations. O(N * k) just expresses that if you double N, the runtime also at most doubles. It doesn't mean it has to be fast in an absolute sense for any particular value of N and k.


Didn't read TFA, but it's hard to think of a linear algebra operation that is both that slow and takes time independent of n and k.


Tail calls can be used for parsing very efficiently: https://news.ycombinator.com/item?id=41289114


I can not say how big ML companies do it, but from personal experience of training vision models, you can absolutely reuse the weights of barely related architectures (add more layers, switch between different normalization layers, switch between separable/full convolution, change activation functions, etc.). Even if the shapes of the weights do not match, just do what you have to do to make them fit (repeat or crop). Of course the models will not work right away, but training will go much faster. I usually get over 10 times faster convergence that way.


It’s possible the model architecture influences the effectiveness of utilizing pretrained weights. i.e. cnns might be a good fit for this since the first portion is the feature extractor, but you might scrap the decoder and simply retrain that.

Can’t say whether the same would work with Transformer architecture, but I would guess there are some portions that could potentially be reused? (there still exists an encoder/feature extraction portion)

If you’re reusing weights from an existing model, then it seems it becomes more of a “fine-tuning” exercise as opposed to training a novel foundational model.


Why would the open weights providers need their own tools for agentic workflows when you can just plug their OpenAI-compatible API URL into existing tools?

Also, there are many providers of open source models with caching (Moonshot AI, Groq, DeepSeek, FireWorks AI, MiniMax): https://openrouter.ai/docs/guides/best-practices/prompt-cach...


> when you can just plug their OpenAI-compatible API URL into existing tools?

Only the self-hosting diehards will bother with that. Those that want to compete with Claude Code, Gemini CLI, Codex et caterva will have to provide the whole package and do it a price point that is competitive even with low volumes - which is hard to do because the big LLM providers are all subsidizing their offerings.


You need a certain level of batch parallelism to make inference efficient, but you also need enough capacity to handle request floods. Being a small provider is not easy.


I just tried it with GPT-5.1-Codex. The compression ratio is not amazing, so not sure if it really worked, but at least it ran without errors.

A few ideas how to make it work for you:

1. You gave a link to a PDF, but you did not describe how you provided the content of the PDF to the model. It might only have read the text with something like pdftotext, which for this PDF results in a garbled mess. It is safer to convert the pages to PNG (e.g. with pdftoppm) and let the model read it from the pages. A prompt like "Transcribe these pages as markdown." should be sufficient. If you can not see what the model did, there is a chance it made things up.

2. You used C++, but Python is much easier to write. You can tell the model to translate the code to C++ once it works in Python.

3. Tell the model to write unit tests to verify that the individual components work as intended.

4. Use Agent Mode and tell the model to print something and to judge whether the output is sensible, so it can debug the code.


Interesting. Thanks for the suggestions.


> I do wonder if there are any DOS vectors that need to be considered if such a large image can be defined in relatively small byte space.

You can already DOS with SVG images. Usually, the browser tab crashes before worse things happen. Most sites therefore do not allow SVG uploads, except GitHub for some reason.


svg is also just kind of annoying to deal with, because the image may or may not even have a size, and if it does, it can be specified in a bunch of different units, so it's a lot harder to get this if you want to store the size of the image or use it anywhere in your code


Here is an even older comment chain about it from 2020: https://news.ycombinator.com/item?id=23895706

Apparently, comparing low-background steel to pre-LLM text is a rather obvious analogy.


As well as that people often do think alike.

If you have a thought, it's likely it's not new.


Oh wow, great find! That’s really early days.


Could you explain a bit how the code works? For example, how does it detect the correct pixel size and how does it find out how to color the (potentially misaligned) pixels?


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You