Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls, especially avoiding them via reverse engineering the channel layout! Ran it on my 9950X3D machine with dual-channel DDR5 and saw clear spikes from 70ns to 330ns every 15us or so.
The hedging technique is a cool demo too, but I’m not sure it’s practical.
At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.
Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.
Another point about HFT - They're mostly using FPGAs (some use custom silicon) which means that they have much tighter control over how DRAM is accessed and how the memory controller is configured. They could implement this in hardware if they really need to, but it wouldn't be at the OS level.
> At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
That’s my main hang up as well. On one hand this is undeniably cool work, but on the other, efficient cache usage is how you maximize throughput.
This optimizes for (narrow) tail latency, but I do wonder at what performance cost. I would be super interested in hearing about real world use cases.
This might be useful in a case where a small lookup or similar is often pushed out from cache such that lookups are usually cold. Yet lookup data might by small enough to not cause issue with cache pollution, increased bandwidth or memory consumption.
Isn't that rather trivial though as a source of tail latency? There's much worse spikes coming from other sources, e.g. power management states within the CPU and possibly other hardware. At the end of the day, this is why simple microcontrollers are still preferred for hard RT workloads. This work doesn't change that in any way.
Yeah exactly, and it’s absolutely dwarfed by the tail latency of going to DRAM in the first place. A cache miss is a 100x tail event vs. an L1 hit. The refresh stall is a further 5x on top of that, which barely registers if you’re already eating the DRAM cost.
It could be massively improved with a special CPU instruction for racing dram reads. That might make it actually useful for real applications. As it is, the threading model she used here would make it incredibly difficult to use this in a real program.
There’s no point racing DRAM reads explicitly. Refreshes are infrequent and the penalty is like 5x on an already fast operation, 1% of the time.
What’s better is to “race” against cache, which is 100x faster than DRAM. CPUs already of do this for independent loads via out-of-order execution. While one load is stalled waiting for DRAM, another can hit the cache and do some compute in parallel. It’s all already handled at the microarchitectural level.
There are already systems that do this in hardware. Any system that has memory mirroring RAS features can do this, notably IBM zEnterprise hardware, you know, the company that this video promoter claims to be one-upping.
The memory controller sends the read to the DIMM that is not refreshing. It is invisible to software, except for the side-effect of having better performance.
It is not only not practical, it is a completely useless technique. I got downvoted to negative infinity for mentioning this, but I guess I am the only person who actually read the benchmark. The reason the technique "works" in the benchmark is that all the threads run free and just record their timestamps. The winner is decided post hoc. This behavior is utterly pointless for real systems. In a real system you need to decide the winner online, which means the winner needs to signal somehow that it has won, and suppress the side effects of the losers, a multi-core coordination problem that wipes out most of the benefit of the tail improvement but, more importantly, also massively worsens the median latency.
You got downvoted for being an asshole, and if you continue to be an asshole on HN we are going to ban you. I suppose you don't believe this because we haven't done it yet even after countless warnings:
The reason we haven't banned you yet is because you obviously know a lot of things that are of interest to the community. That's good. But the damage you cause here by routinely poisoning the threads exceeds the goodness that you add by sharing information. This is not going to last, so if you want not to be banned on HN, please fix it.
How so, being precise and correct is IMO worth preserving in a world of handwaving slop.
The industrial revolution was from ~1760–1840, it was a major shift it doesn’t cover everything that happens between 1760 and now more did it overwhelm many existing trends.
Based on the recent leaks, their system prompt explicitly nudges the model not to do anything outside of what was asked. That could very well explain why it’s not fixing preexisting broken tests.
“Don't add features, refactor code, or make "improvements" beyond what was asked.”
And it's very valid. Because otherwise you would ask Claude to trim a tree and it would go raze the whole forest and plant new seeds. This was the primary pain point last year, especially with Sonnet.
I want to play a game. In your hands is a chainsaw about to be destructed. Another exception is already in flight. Live, or std::terminate. Make your choice. -Jigsaw
So... a vibe slop index to keep track of all the vibe slop apps?
The cherry on top: it’s completely broken! Enable the Context Awareness filter, the list shrinks. Now enable the Auto-pasting filter, the list grows back.
I wouldn't call it completely broken; Pressing buttons still does something, it looks like an OR filter instead of an AND. It should be updated to be an AND filter as that's more intuitive.
If you squint, it looks kinda maybe superficially useful? But if you actually critically look at it, it makes no sense.
The categories are clearly LLM generated from the GhostPepper codebase, with vague low level descriptions and links to code. Most categories apply to every listed project.
The UI is the same tiny bit of LLM generated information displayed five different confusing ways. Like seriously, click on a project and you first see a bunch of haphazard feature cards, then a bunch of “feature ... active” rows. Looks fancy, but actually just noise. Textbook slop.
Better would be a simple awesome-style markdown page, with a feature matrix having categories and descriptions curated by a human that actually understands and cares about the domain.
Sorry if this is harsh, but passing off LLM output as “curation” is particularly insulting to me.
There is some top class wizardry going on there! I don’t think I’ve ever used conditions in a type definition in C++ :)
Update:
Ah, alright - so that evaluation logic is part of the template, not the code that eventually compiles.
It’s basically offloading some of the higher level language compiler logic to the templating engine. Honestly might be a better time investment than spending more time writing this in the parser.
Now I’m sort of intrigued and inspired to use C++ as a lowering target for elevate (a compiler framework I’ve been working on).
Do you really disagree that it’s advancing science? Surely actually testing hardware, building knowledge on how to run this type of mission, learning to use lunar resources, figuring out how to keep people alive, etc. will teach us things we couldn’t learn any other way.
Fwiw do share your concerns about the methods (sending humans on this specific mission is questionable, SLS is questionable compared to SpaceX approach).
Do you think we will learn more from Artemis or the Asteroid Redirect Mission? Because that's a concrete example of how funding this mission caused other experiments to be cancelled.
Fair point, but that’s an argument about prioritization within NASA’s budget (and its size relative to other spending), not the scientific value of the mission.
There's never non-zero value to any challenging engineering problem. The question is whether the finite resources spent to solve it are best spent on it versus other projects.
And in this mission in particular, you can't divorce science from politics. NASA's budget was reined in by Trump 45 and his admin picked Artemis because a manned mission to the moon invokes a particular feeling and memory, not because it benefits science. The moon is a known quantity, and going there is not more valuable than the other projects the government could have spent $100 billion on.
Keep in mind, this is one of the most expensive single launches in history while there is a partial government shutdown and the rest of the federal government that does real research has been gutted by this same administration. So it's tough to talk about "scientific value" when it's obvious that this mission is doing little science at the same time the government has decreed it won't be in the business of paying for science.
The moon isn’t a known quantity, we sent a handful of people there for a combined few days half a century ago. There’s immense scientific and engineering value in keeping a generation of engineers fluent in deep space operations.
If you’re angry about this dumpster fire of an administration wasting money and gutting research (I am too), the answer is to fight for better funding across the board, not to tear down one of the few ambitious programs left that’s actually pushing the boundaries on what we can do. NASA’s budget amounts to a rounding error and isn’t zero sum with the rest of federal science funding, these are separate appropriations.
I'm going to tear down spending $2.5 billion to test the toilets on a space ship every chance I get. It is a massive waste of resources and depletion of human capital that would be better spent on other projects that could advance science and human understanding.
It's not science, it's engineering. I don't think it's advancing science in a way that wouldn't be possible with a fraction of the cost without sending humans there.
The hedging technique is a cool demo too, but I’m not sure it’s practical.
At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.
Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.
reply