For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | anematode's commentsregister

It's not quite the same as XOR swap, but a trick I've found handy is conditionally swapping values using XOR:

    int a, b;
    bool cond;
    
    int swap = cond ? a ^ b : 0;
    a ^= swap;
    b ^= swap;
If cond is highly unpredictable this can work rather nicely.

Electricity is heavily regulated. Is there any evidence that LLMs will be the same?

Was electricity regulated in the first decade of its existence?

I don't know but likely not. Factories were powered by steam then, and had a "power plant" on site. So they didn't convert to electricity until it was reliable and guaranteed.

Was anything regulated in those times? You could legally buy humans at that time.

But that doesn't mean we live with same standards. Lack of regulations in electricity led to a lot of deaths and disaster which is why it was regulated.

But we dont live in the start of 20th century, we live in 2026 and we must learn from the past instead of helbent on repeating it.


Cool investigation. This part perplexes me, though:

> Games have apparently been using split locks for quite a while, and have not created issues even on AMD’s Zen 2 and Zen 5.

For the life of me I don't understand why you'd ever want to do an atomic operation that's not naturally aligned, let alone one split across cache lines....


> For the life of me I don't understand why you'd ever want to do an atomic operation that's not naturally aligned, let alone one split across cache lines....

I assume they force packed their structure and it's poorly aligned, but x86 doesn't fault on unaligned access and Windows doesn't detect and punish split locks, so while you probably would get better performance with proper alignment, it might not be a meaningful improvement on the majority of the machines running the program.


Ah, that's a great hypothesis. I wonder, then, how it works with x86 emulation on ARM. IIRC, atomic ops on ARM fault if the address isn't naturally aligned... but I guess the runtime could intercept that and handle it slowly.

ARM macs apparently have some kind of specific handling in place for this when a process is running with x86_64 compatibility, but it’s not publicly documented anywhere that I can see.

XNU has this oddity: https://github.com/apple-oss-distributions/xnu/blob/f6217f89...

Redacted from open source XNU, but exists in the closed source version


Is it actually redacted, or just a leftover stub from a feature implemented in silicon instead of software? Isn't the x86 memory order compatibility done at hardware level?

Redacted

An emulated x86 atomic instruction wouldn’t need to use atomic instructions on ARM.

Why not?

They don’t have to match.

As an example, what about a divide instruction. A machine without an FPU can emulate a machine that has one. It will legitimately have to run hundreds/thousands of instructions to emulate a single divide instruction, it will certainly take longer.

Thats OK, just means the emulation is slower doing that than something like add that the host has a native instruction for. In ‘emulator time’ you still only ran one instruction. That world is still consistent.


? That's not how Windows on ARM emulation works. It uses dynamic JIT translation from x86 to ARM. When the compiler sees, e.g., lock add [mem], reg presumably it'll emit a ldadd, but that will have different semantics if the operand is misaligned.

You mean the locking would be done in software?

They don't do it on purpose.

It's just really easy to do accidentally with custom allocators, and games tend to use custom allocators for performance reasons.

The system malloc will return pointers aligned to the size of the largest Atomic operation by default (16 bytes on x86), and compilers depend on this automatic alignment for correctness. But it's real easy for a custom allocator use a smaller alignment. Maybe the author didn't know, maybe they assumed they would never need the full 16-byte atomics. Maybe the 16-byte atomics weren't added until well after the custom allocator.


Packing structures can improve performance and overall memory usage by reducing cache misses.

Unlikely – no one is starting off undecided, then reading one article in The New Yorker and then committing this. And it's a slippery slope to tie it to legitimate criticism.

As far as I know, the US is the only country like this. But anti-AI sentiment is rising around the world.

Abysmal response.

Dear lord. It's actually laggy for me to scroll on that page.

same here and I'm using a beefy MacBook (Apple M4 Max, 64gb ram). something is wrong with the front end code. there are a lot of animations, so my hunch would be that something goes wrong there.

Moore said computers get twice as fast every 18 months. Web devs took that as a challenge.

He said transistor count on chip doubles. (The more accurate pithy comment would be they took it as available resources.)

U+1F913

This was already posted: https://news.ycombinator.com/item?id=47047936

It contains many factual errors.


Hey author here. I have actually rewritten the parts where you might call AI slop. I have tried to correct the text to the best of my abilities.

https://bgslabs.org/blog/evolution-of-x86-simd/#acknowledgem...

If you look at there I tried to be as transparent about this process as possible. I simply didn't know any better than to use AI to fact check my data when I first started - which was a really bad idea and led to the horrendous outcome as you've seen there. I am not trying to hide anything, I made a mistake. If you could give the article a re-read and tell me where I might have gone wrong I would be really happy. I actually want this to be a good and useful educational resource, not AI slop.

Thank you for your time regardless.


Quoting from the README:

> The entire VSCode workbench - editor, terminal, extensions, themes, keybindings — ported to run on a native shell.

but also

> Many workbench features are stubbed or partially implemented

So which is it? The README needs to be clearer about what is aspirational, what is done, and what is out of scope. Right now it just looks like an LLM soup.


The first sentence is aspirational, while the second is describing current state.

See I thought that, but then in the putatively aspirational section it says

> 5,600+ TypeScript files from VSCode's source, ported and adapted

which doesn't really make sense as a goal?


Hey Creator of Sidex, I just updated the Read me to be alot clearer would love your feedback on it to help clear the air

Nice post :)

Last year I was working on a tail-call interpreter (https://github.com/anematode/b-jvm/blob/main/vm/interpreter2...) and found a similar regression on WASM when transforming it from a switch-dispatch loop to tail calls. SpiderMonkey did the best with almost no regression, while V8 and JSC totally crapped out – same finding as the blog post. Because I was targeting both native and WASM I wrote a convoluted macro system that would do a switch-dispatch on WASM and tail calls on native.

Ultimately, because V8's register allocation couldn't handle the switch-loop and was spilling everything, I basically manually outlined all the bytecodes whose implementations were too bloated. But V8 would still inline those implementations and shoot itself in the foot, so I wrote a wasm-opt pass to indirect them through a __funcref table, which prevented inlining.

One trick, to get a little more perf out of the WASM tail-call version, is to use a typed __funcref table. This was really horrible to set up and I actually had to write a wasm-opt pass for this, but basically, if you just naively do a tail call of a "function pointer" (which in WASM is usually an index into some global table), the VM has to check for the validity of the pointer as well as a matching signature. With a __funcref table you can guarantee that the function is valid, avoiding all these annoying checks.


The article shows WASM being 1.2-3.7x slower, and your experience confirms it.

Do you have any idea which operations regress the most?


Based on looking at V8's JITed code, there seemed to be a lot of overhead with stack overflow checking, actually. The function prologues and epilogues were just as bloated in the tail-call case. I'll upload some screenshots if I can find them.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You