Just wanted to mention genetic algorithms (GAs), popularized by John Koza and others.
The post uses a 4 instruction program as an example having about 256^4 or 4 billion combinations. Most interesting programs are 10, 100, 1000+ instructions long, which is too large of a search space to explore by brute force.
So GAs use a number of tricks to investigate the search space via hill climbing without getting stuck at local optima. They do that by treating the search space as a bit string, then randomly flipping bits (mutation) or swapping bits (sexual reproduction) to hop to related hills in the search space. Then the bit string is converted back to instructions and tested to see if it performs the desired algorithm.
The bit string usually encodes the tree form of a Lisp program to minimize syntax. We can think of it as if every token is encoded in bits (like Huffman encoding inspired by Morse code) For example, the tokens in a (+ 1 2) expression might have the encoding 00, 01 and 10, so the bit string would be 000110, and we can quickly explore all 2^3 = 8 permutations (2^6 = 64 if we naively manipulate an uncompressed bit string whose encoded token sizes vary).
Note that many of the bit strings like (+ + 1) or (2 1 +) don't run. So guard rails can be added to reduce the search space, for example by breaking out early when bit strings throw a compiler exception, or using SAT solvers or caching to weed out nonviable bit strings.
We could build a superoptimizer with GAs, then transpile between MOS 6502 assembly and Lisp (or even run the MOS 6502 assembly directly in a sandbox) and not have to know anything about how the processor works. To me, this is the real beauty of GAs, because they allow us to solve problems without training, at the cost of efficiency.
I don't think that LLMs transpile to Lisp when they're designing algorithms. So it's interesting that they can achieve high complexity and high efficiency via training, without even having verification built-in. Although LLMs trained on trillions of parameters running on teraflops GPUs with GBs of memory may or may not be viewed as "efficient".
I suspect that someday GAs may be incorporated into backpropagation to drastically reduce learning time by finding close approximations to the matrix math of gradient descent. GAs were just starting to be used to pseudorandomly produce the initial weights of neural nets around 2000 when I first learned about them.
Also quantum computing (QC) could perform certain matrix math in a fraction of the time, or even preemptively filter out bit strings which aren't runnable. I suspect that AI will get an efficiency boost around 2030 when QC goes mainstream. Which will probably lead us to a final candidate learning algorithm that explains how quantum uncertainty and emergent behavior allow a physical mind to tune into consciousness and feel self-aware, but I digress.
Because modern compilers don't do any of this, and we aren't accustomed to multicore computing, then from a sheer number of transistors perspective, we're only getting a tiny fraction of the computing power that we might otherwise have if we designed chips from scratch using modern techniques. This is why I often say that computers today run thousands of times slower than they should for their transistor budgets.
Andreessen's criticism of introspection, and Musk's criticism of empathy, are projections of their fear of being disconnected from spirit (primarily the notion that we're all one).
Some of us eventually find ourselves in situations that defy logical explanation. I've witnessed my own thoughts and plans rippling out into the world and causing external events to unfold. To the point that now, I'm not sure that someone could present evidence to me to prove that our inner and outer worlds aren't connected. It's almost as hard of a problem as science trying to solve how consciousness works, which is why it has nothing to say about it and leaves it to theologians.
The closest metaphysical explanation I have found is that consciousness exists as a field that transcends 4D timespace, so our thoughts shift our awareness into the physical reality of the multiverse that supports its existence. Where one 4D reality is deterministic without free will, 5D reality is stochastic and may only exist because of free will. And this happens for everyone at all times, so that our individuality can be thought of as drops condensed out of the same ocean of consciousness. One spirit fragmented into countless vantage points to subjectively experience reality in separation so as to not be alone.
Meaning that one soul hoarding wealth likely increases its own suffering in its next life.
That realization is at odds with stuff like western religion and capitalism, so the wealthy reject it to protect their ego. Without knowing that (or denying that) ego death can be a crucial part of the ascension process.
My great frustration with this is the power imbalance.
Most of us spend the entirety of our lives treading water, sacrificing some part of our prosperity for others. We have trouble stepping back from that and accepting the level of risk and/or ruthlessness required to take from others to give to ourselves. We lose financially due to our own altruism, or more accurately the taking advantage of that altruism by people acting amorally.
Meanwhile those people win financially and pull up the ladder behind them. They have countless ways, means and opportunities to reduce suffering for others, but choose not to.
The embrace or rejection of altruism shouldn't be what determines financial security, but that's the reality we find ourselves in. Nobility become its opposite.
That's what concepts like taxing the rich are about. In late-stage capitalism, a small number of financial elites eventually rig the game so that others can't win, or arguably even play.
It's the economic expression of the paradox of tolerance.
So the question is, how much more of this are we willing to tolerate before the elites reach the endgame and see the world burn?
Note that we had the technology to do this affordably as of about 2008, when lithium iron phosphate (LiFePO4) batteries became widely available for about $10-12 each (I had to look that up). They were definitely available at low cost ($6) by 2018:
Looks like sodium-ion (Na-ion) 18650 batteries at 1.5 Ah have about 1/2 the capacity of LiFePO4 18650s at 3.5 Ah, and are about twice the price, so lets call them 4x the price per energy stored:
So we can project that Na-ion batteries will have the same price per kWh as today's LiFePO4 in perhaps 8 years, or around 2034, if not sooner. That will negate the lithium supply chain bottleneck so that we're limited to ordinary shortages (like copper).
500 W bifacial solar panels are available for $100 each in bulk, so there's no need to analyze them since they're no longer the bottleneck. A typical home uses 24 kWh/day, so 15-20 panels at a typical 4.5 kW/m2 solar insolation provide enough power to charge batteries and still have some energy left over, at a cost of $1500-2000. Installation labor, electricians/licensing, inverters and batteries now dominate cost.
The sodium ion battery market is about $1 billion annually, vs $100 billion for lithium ion. It took lithium about 15-20 years to grow that much. So whoever gets in now could see a 1-2 orders of magnitude return over perhaps 8-15 years. I almost can't think of a better investment outside of AI.
-
I've been watching this stuff since the 1980s and I can tell you that every renewable energy breakthrough coincides with a geopolitical instability. For the $8 trillion the US spent on Middle East wars since 9/11, we could have had a moonshot for solar+batteries and be at 90+% coverage today. Not counting the other $12 trillion the US spent on the Cold War. Fully $20 trillion of our ~$40 trillion US national debt went to funding endless war, with the other $20 trillion lost on trickle-down tax cuts for the ultra wealthy.
We can't do anything about that stuff in the short term. But we can move towards off-grid living and a distributed means of production model where AI, 3D printing, permaculture, and other alternative tech negates the need for investment capital.
In the K-shaped economy, the "if you can't beat 'em, join 'em" phrase might more accurately be stated "if you can't join 'em, beat 'em".
Forkrun is part of a vanishingly small number of projects written since the 1990s that get real work done as far as multicore computing goes.
I'm not super-familiar with NUMA, but hopefully its concepts might be applicable to other architectures. I noticed that you mentioned things like atomic add in the readme, so that gives me confidence that you really understand this stuff at a deep level.
My use case might eventually be to write a self-parallelizing programming language where higher-order methods run as isolated processes. Everything would be const by default to make imperative code available in a functional runtime. Then the compiler could turn loops and conditionals into higher-order methods since there are no side effects. Any mutability could be provided by monads enforcing the imperative shell, functional core pattern so that we could track state changes and enumerate all exceptional cases.
Basically we could write JavaScript/C-style code having MATLAB-style matrix operators that runs thousands of times faster than current languages, without the friction/limitations of shaders or the cognitive overhead of OpenCL/CUDA.
-
I feel that pretty much all modern computer architectures are designed incorrectly, which I've ranted about countless times on HN. The issue is that real workloads mostly wait for memory, since the CPU can run hundreds of times faster than load/store, especially for cache and branch prediction misses. So fabs invested billions of dollars into cache and branch prediction (that was the incorrect part).
They should have invested in multicore with local memories acting together as a content-addressable memory. Then fork with copy-on-write would have provided parallelism for free.
Instead, CPU progress (and arguably Moore's law itself) ended around 2007 with the arrival of the iPhone and Android, which sent R&D money to low-cost and low-power embedded chips. So the world was forced to jump on the GPU bandwagon, doubling down endlessly on SIMD instead of giving us MIMD.
Leaving us with what we have today: a dumpster fire of incompatible paradigms like OpenGL, Direct3D, Vulkan, Metal, TPUs, etc.
When we could have had transputers with unlimited compute and memory, scaling linearly with cost, that could run 3D and AI libraries as abstraction layers. Sadly that's only available in cloud computing currently.
We just got lucky that neural nets can run on GPUs. It would have been better to have access to the dozen or so other machine learning algorithms, especially genetic algorithms (which run poorly on GPUs).
forkrun's NUMA approach is really largely based on the idea that, as you said, "real workloads mostly wait for memory". The waiting for memory gets worse in NUMA because accessing memory from a different chiplet or a different socket requires accessing data that is physically farther from the CPU and thus has higher latency. forkrun takes a somewhat unique approach in dealing with this: instead of taking data in, putting it somewhere, and reshuffling it around based on demand, forkrun immediately puts it on the correct numa node's memory when it comes in. This creates a NUMA-striped global data memfd. on NUMA forkrun duplicates most of its machinery (indexer+scanner+worker pool) per node, and each node's machinery is only offered chunks from the global data memfd that are already on node-local memory.
This directly aims to solve (or at least reduce the effect from) "CPUs waiting for memory" on NUMA systems, where the wait (if memory has to cross sockets) can be substantial.
I don't know why you or your parent commenter got downvoted, but I use that as evidence that the end is very near.
With the current geopolitical climate and the arrival of AI, I'm predicting a sharp economic downturn at the end of the year the likes of which we haven't seen in a century.
I mean the Housing Bubble popping and the Dot Bomb were bad, but the US national debt was so much lower then. Income inequality was lower. Student loan debt was lower. Healthcare was more affordable. Homes were more affordable. Food was more affordable. We had (some) faith in our electoral process.
When the cheap capital runs out, when value of the dollar collapses due to unforced error, when the overseas investment dries up, when billionaires panic and yank their investment in AI (leaving us with a duopoly like always), when the employment rate peaks never to return, when companies stop hiring for the foreseeable future, when people stop visiting websites or buying software, when we abandon liberal arts for the trades in Service Economy 2.0, when hospitals and universities close, when farms go bankrupt, when interest on the US national debt consumes its social safety net, when we sell our public lands for pennies on the dollar, when nobody is held accountable..
That's when we the people will remember who we are. Somehow, like every other time before, we'll pull ourselves up by our bootstraps from nothing. Without time, money or resources, we'll come together and find a way to rebuild. We won't even tax the rich or incite violence against them, we'll simply manifest the abundant reality that's been denied to us by them for so long.
That looks like organizing. Unions. Cooperatives. Mutual aid networks. Renewable energy. Permaculture. Voluntary employment and clock-in. Credit unions and crowdfunding. Automation. Distributed means of production. Fair trade. Class action lawsuits. Boycotts. Voting against incumbents. Solarpunk.
We'll transcend competition and see the matrix for the bill of goods that it is. Rather than trying to get the money and power back in futility, we'll make them irrelevant.
It's time to start thinking about selling those stocks. Divesting from the blood money of unearned income that comes from exploitation, suffering and war (even though they don't tell us that). Steering clear of prediction markets. Dropping the crypto.
We know they won't. But that's why they'll stay insulated from knowing what stuff they're made of, holding out as long as possible, lonely and alone. And the fun part is, they'll get to find out anyway when the music stops.
I remember having this debate back in the late 1990s when I was in college for my electrical and computer engineering (ECE) degree. At the time as students, we didn't really know about nuances like delta cycles, so preferring Verilog or VHDL came down to matter of personal taste.
Knowing what I know now, I'm glad that they taught us VHDL. Also that's one of the reasons that it's worth trying to get into the best college that you can, because as long as you're learning stuff, you might as well learn the most rigorous way of doing it.
---
It's these sorts of nuances that make me skeptical of casual languages like Ruby and even PHP (my favorite despite its countless warts). I wish that we had this level of insight back during the PHP 4 to 5 transition, because so many easily avoidable mistakes were made in a design-by-committee fashion.
For example, PHP classes don't use copy-on-write like arrays, so we missed out on avoiding a whole host of footguns, as well as being able to use [] or -> interchangeably like in JavaScript. While we're at it, the "." operator to join arrays was a tragic choice (they should have used & or .. IMHO) because then we could have used "." for the object operator instead of -> (borrowed from C++), but I digress.
I often dream of writing a new language someday at the intersection of all of these lessons learned, so that we could write imperative-looking code that runs in a functional runtime. It would mostly encourage using higher-order methods strung together, but have a smart enough optimizer that it can handle loops and conditional logic by converting them to higher-order methods internally (since pure code has no side effects). Basically the intermediate code (i-code) would be a tree representation in the same form as Lisp or a spreadsheet, that could be transpiled to all of these other languages. But with special treatment of mutability (monadic behavior). The code would be pure-functional but suspend to read/write outside state in order to enforce the functional core, imperative shell pattern.
A language like that might let us write business logic that's automatically parallelized and could be synthesized in hardware unmodified. It would tend to execute many thousands of times faster than anything today on modifiable hardware like an FPGA. I'd actually prefer to run it on a transputer, but those fell out of fashion decades ago after monopoly forces took over.
By extension, are you also an antitrust enforcement denier?
Also by extension, do you understand the term late-stage capitalism?
Because if you truly believe that regulation isn't necessary, then you are either ok with, or unaware that, unregulated capitalism ends in monopoly (or duopoly to keep up appearances). A free market only has a chance of existing under regulation, otherwise it's immediately gamed to maximize profit, which leads to runaway wealth inequality (the antithesis of a free market).
In other words, a €730 ($835) top case replacement is only allowed to exist because your worship of deregulation prevents the very competition that you yearn for.
I don't normally word my comments this strongly, but we seem to have lost our BS detectors since yours is the top comment.
Remember that it's ok to change your mind. So I'm not criticizing you, but the mindset that's allowing fundamental mistakes to not only go unchallenged, but be celebrated.
Lol I think watching the entirety of the EU run on shitty plastic laptops that are government approved would really top the British fight against heatwaves by smearing yoghurt on their windows.
I think I’d really enjoy this. Yes, please do this.
Not to mention that WindowServer seems to take 100+% cpu since the upgrade. Also I can't paste filenames in the save file dialog in some apps. And the URL field in Safari is just weird.
My computer was running so slowly that I had to minimize transparency in system preferences somewhere. I think I also turned off opening every app in its own space. And I hid the icons on the Desktop in Finder settings somehow, which helped a lot. There are countless other little tweaks that are worth investigating.
I also highly recommend App Tamer (no affiliation). It lets you jail background apps at 10% cpu or whatever. It won't help with WindowServer or kernel_task (which also often runs at 100+% cpu), but it's something.
I can't help but feel that there's nobody at the wheel at Apple anymore. When I have to wait multiple seconds to open a window, to switch between apps, to go to my Applications folder, then something is terribly wrong. Computers have been running thousands of times slower than they should be for decades, but now it's reaching the point where daily work is becoming difficult.
I'm cautiously optimistic that AI will let us build full operating systems using other OSs as working examples. Then we can finally boot up with better alternatives that force Apple/Microsoft/Google to try again. I could see Finder or File Explorer alternatives replacing the native ones.
> Computers have been running thousands of times slower than they should be for decades
I've been hearing this complaint for decades and I'll never understand it. The suggestion seems completely at odds with my own experience. Regardless of OS, they all seem extremely fast, and feel faster and faster as time goes on.
I remember a time when I could visually see the screen repaint after minimizing a window, or waiting 3 minutes for the OS to boot, or waiting 30 minutes to install a 600mb video game from local media. My m2 air with 16gb of memory only has to reboot for updates, I haphazardly open 100 browser tabs, run spotify, slack, an IDE, build whatever project I'm working on, and the machine occasionally gets warm. Everything works fine, I never have performance issues. My linux machines, gaming pc, and phone feel just as snappy. It feels to me that we are living in a golden age of computer performance.
I think the best example is in iOS. On old iOS versions, the keyboard responsiveness took precedence over everything, no matter what. If you touched the keyboard, it would respond with an animation indicating what you are doing. The app itself may be frozen, but the self contained keyboard process would continue on, letting you know the app you are using is a buggy mess.
Now in iOS 26, you can just be typing in Notes or just the safari address bar for example, and the keyboard will randomly lag behind and freeze, likely because it is waiting on some autocomplete task to run on the keyboard process itself. And this is on top of the line, modern hardware.
A lot of the fundamentals that were focused on in the past to ensure responsiveness to user input was never lost, became lost. And lost for no real good reason, other than lazy development practices, unnecessary abstraction layers, and other modern developer conveniences.
Yeah long ago when I was doing some iOS development, I can remember Apple UX responsives mantras like “don’t block the main thread”, as it’s the thing responsible for making app UIs snappy even when something is happening.
Nowadays seems like half of Apple’s own software blocks on their main thread, like you said things like keyboard lock up for no reason. God forbid you try to paste too much text into a Note - the paste will crawl to a halt. Or, on my M4 max MacBook, 128GB ram, 8tb ssd, Photos library all originals saved locally - I try to cmd-R to rotate an image - the rotation of a fully local image can sometimes take >10 seconds while showing a blocking UI “Rotating Image…”, it’s insane how low the bar has dropped for Apple software.
My M4 Max 128GB ... 90% of the time is like you say.
10% of the time, Windowserver takes off and spends 150% CPU. Or I develop keystroke lag. Or I can't get a terminal open because Time Machine has the backup volume in the half mounted state.
It's thousands of times faster than the Ultra 1 that was once on my desk. And I can certainly do workloads that fundamentally take thousands of times more cycles. But I usually spend a greater proportion of this machine's speed on the UI and responsiveness doesn't always win over 30 years ago.
Ok. Today we have multi-Ghz processors, with multiple cores at that.
Photons travel about 1 foot per nanosecond ... so the CPU can executes MANY instructions between the time photons leave your screen, and the time they reach your eyes.
Now, on Windows start Word (on a Mac start Writer) ... come on ... I'll wait.
Still with me? Don't blame the SSD and reload it again from the cache.
Not sure where you're getting at. MS Word, full load to ready state after macOS reboot takes ~ 2 seconds on my M1 mac. If I close and re-open it (so it's on fs cache) is takes about ~1 second.
You, and sibling comment author just never experienced the truly responsive ui.
It is one where reaction is under a single frame from action. EDIT: and frame is 1/60s, that is 16.(6)ms. I feel bad feeling I have to mention this basic fact.
This was possible on 1980s hardware. I witnessed that, I used that. Why is it not possible now?
I've used 1980s hardware. In the 80s. And used UNIX and HP/Sun/SGI/etc hardware since the 90s. Not only it was no "truly responsive", nothing opened in a "single frame" (talking about X Windows). Took way longer then 1-2 seconds to open a browser on a blank page for example, and for many programs you saw them slowly drawing.
And I did. And it did. Like, Amiga, even 500 models.
I do not doubt X was horrible from that pov. I remember R5. This is not that I meant.
edit: there were no web browsers back then. the effin "folder browser" opens slower on my xfce4 than the same in an a1200 emulator in a window next to it. this is sad.
Not only it takes a second just to redraw a moved window (with mid-way frames and flashing in between), opening a tiny program is slow and shows the "zzz" busy indicator.
Base model M4 Mac Mini -- takes 2 seconds to load Word (and ready to type) without it being cached. Less than 1 second if I quit it completely, and launch again, which I assume is because it's cached in RAM.
> Regardless of OS, they all seem extremely fast, and feel faster and faster as time goes on.
This very much depends on what hardware you have and what you're doing on it (how much spare capacity you have).
Back in university I had a Techbite Zin 2, it had a Celeron N3350 and 4 GB of LPDDR4. It was affordable for me as a student (while I also had a PC in the dorm) and the keyboard was great and it worked out nicely for note taking and some web browsing when visiting parents in the countryside.
At the same time, the OS made a world of difference and it was anything but fast. Windows was pretty much unusable and it was the kind of hardware where you started to think whether you really need XFCE or whether LXDE would be enough.
I think both of the statements can be true: that Wirth's law is true and computers run way, way slower than they should due to bad software... and that normally you don't really feel it due to us throwing a lot of hardware at the problem to make us able to ignore it.
It's largely the same as you get with modern video game graphics and engines like UE5, where only now we are seeing horrible performance across the board that mainstream hardware often can't make up for and so devs reach for upscaling and framegen as something they demand you use (e.g. Borderlands 4), instead of just something to use for mobile gaming.
It's also like running ESLint and Prettier on your project and having a full build and formatting iteration take like 2 minutes without cache (though faster with cache), HOWEVER then you install Oxlint and Oxfmt and are surprised to find out that it takes SECONDS for the whole codebase. Maybe the "rewrite it in Rust" folks had a point. Bad code in Rust and similar languages will still run badly, but a fast runtime will make good code fly.
I could also probably compare the old Skype against modern Teams, or probably any split between the pre-Electron and modern day world.
Note: runtime in the loose sense, e.g. compiled native executables, vs the kind that also have GC, vs something like JVM and .NET, vs other interpreters like Python and Ruby and so on. Idk what you'd call it more precisely, execution model?
> Regardless of OS, they all seem extremely fast, and feel faster and faster as time goes on.
The modern throughput is faster by far. However, what some people mean when they talk about "slower" is the latency snappiness that characterizes early microcomputer systems. That has definitely gotten way worse in an empirically measurable fashion.
Dan Luu's article explains this very well [1].
It is difficult today to go through that lived experience of that low latency today because you don't appreciate it until you lived it for years. Few people have access to an Apple ][ rig with a composite monitor for years on end any longer. The hackers that experienced that low latency never forgot it, because the responsiveness feels like a fluid extension of your thoughts in a way higher latency systems cannot match.
I wonder if this ties into why I'm baffled at the increasing trend of adding fake delays (f/ex "view transitions"). It's maddening to me. It's generally not a masking/performance delay either; I've recompiled a number of android apps for example to remove these sorts of things, and some actions that took an entire second to complete previously happen instantly after modification.
Ohhhh trust me, I have, assuming you mean "Disable animations". The three duration scale developer settings too. Thank you for suggesting it, though, just in case.
Some apps do respect it, but sometimes it's hardcoded, and OS settings don't seem to override it. Even the OS doesn't respect it in some cases, but I think it used to. Flutter apps? Forget about it.
A really annoying thing I've run into is that lots of libraries/frameworks/etc will have shortcuts to introduce this delay, to avoid "pop-in" of lazy-loaded stuff.
Like, yeah, pop-in looks a little weird, but suddenly APIs are making that one Mass Effect elevator into a first-class feature...
>Regardless of OS, they all seem extremely fast, and feel faster and faster as time goes on.
One analogy is that the distance between two places in the world hasn't changed, but we're not arriving significantly faster than we before modern jetliners were invented. There was a period of new technology followed by rapid incremental progress toward shortened travel times until it leveled off.
However, the number of people able to consistently travel between more places in the world has continued to increase. New airports open regularly, and airliners have been optimized to fit more people, at the cost of passenger comfort.
Similarly, computers, operating systems, and their software aren't aligned in optimizing for user experience. Until a certain point, user interactions on MacOS took highest priority, which is why a single or dual core Mac felt more responsive than today, despite the capabilities and total work capacity of new Macs being orders of magnitude higher.
So we're not really even asking for the equivalent of faster jet planes, here, just wistfully remembering when we didn't need to arrive hours early to wait in lines and have to undress to get through security. Eventually all of us who remember the old era will be gone, and the next people will yearn for something that has changed from the experiences they shared.
But apps shouldn't be able to hammer WindowServer in the first place. If your app is misbehaving, your app should hang, not the OS window compositor!
FWIU there's really no backpressure mechanism for apps delegating compositing (via CoreAnimation / CALayers) to WindowServer which is the real problem IMO.
QubesOS seems a great migration target: it runs Apps/OS in secure sandboxes - and even with that overhead doesn't seem worse than the terrible MacOS 26 performance.
> I'm cautiously optimistic that AI will let us build full operating systems using other OSs as working examples.
Why? No one has shown that LLMs produce particularly good code. You can get a lot of useful shit done with what is still slop, but there is no reason to believe there's any evolutionary improvement.
I don't know much about CPU internals, but this sounds like bullshit to me. A NOP is still an instruction that uses a cycle - why should that cool the CPU down? The CPU frequency should get reduced to lower the power consumption and hence the temperature.
Not all cycles cost the same amount of power. (Not that you would want to spam nops for thermal management, you should idle the core with a pause etc that actually tells the processor what you are trying to do.)
It used to be the case with intel macs and their atrocious confluence of cooling system, thermals, and power supply system (the CPU actually was not really to blame).
But when RAPL and similar tools to throttle CPU are used, the CPU time gets reported as kernel_task - on linux it would show similarly as one of the kernel threads.
It only took a quarter century, but I'm glad that somebody is finally adding a little multicore competition since Moore's law began failing in the mid-2000s.
I looked around a bit, and the going rate appears to be about $10,000 per 64 cores, or around $150 per core. Here is an Intel Xeon Platinum 8592+ 64 Core Processor with 61 billion transistors:
So that's about 500 million transistors per dollar, or 1 billion transistors for $2.
It looks like Arm's 136 core Neoverse V3 has between 150 and 200 billion transistors, so it should cost around $400. Each blade has 2 of those chips, so should be around $800-1000 for compute. It doesn't say how much memory the blades come with, but that's a secondary concern.
Note that this is way too many cores for 1 bus, since by Amdahl's law, more than about 4-8 cores per bus typically results in the remaining cores getting wasted. Real-world performance will be bandwidth-limited, so I would expect a blade to perform about the same as a 16-64 core computer. But that depends on mesh topology, so maybe I'm wrong (AI thinks I might be):
Intel Xeon Scalable: Switched from a Ring to a Mesh Architecture starting with Skylake-SP to handle higher core counts.
Arm Neoverse V3 / AGI: Uses the Arm CMN-700 (Coherent Mesh Network), which is a high-bandwidth 2D mesh designed specifically to link over 100 cores and multiple memory controllers.
I find all of this to be somewhat exhausting. We're long overdue for modular transputers. I'm envisioning small boards with 4-16 cores between 1-4 GHz and 1-16 GB of memory approaching $100 or less with economies of scale. They would be stackable horizontally and vertically, to easily create clusters with as many cores as one desires. The cluster could appear to the user as an array of separate computers, a single multicore computer running in a unified address space, or various custom configurations. Then libraries could provide APIs to run existing 3D, AI, tensor and similar SIMD code, since it's trivial to run SIMD on MIMD but very challenging to run MIMD on SIMD. This is similar to how we often see Lisp runtimes written in C/C++, but never C/C++ runtimes written in Lisp.
It would have been unthinkable to design such a thing even a year ago, but with the arrival of AI, that seems straightforward, even pedestrian. If this design ever manifests, I do wonder how hard it would be to get into a fab. It's a chicken and egg problem, because people can't imagine a world that isn't compute-bound, just like they couldn't imagine a world after the arrival of AI.
Edit: https://news.ycombinator.com/item?id=47506641 has Arm AGI specs. Looks like it has DDR5-8800 (12x DDR5 channels) so that's just under 12 cores per bus, which actually aligns well with Amdahl's law. Maybe Arm is building the transputer I always wanted. I just wish prices were an order of magnitude lower so that we could actually play around with this stuff.
The post uses a 4 instruction program as an example having about 256^4 or 4 billion combinations. Most interesting programs are 10, 100, 1000+ instructions long, which is too large of a search space to explore by brute force.
So GAs use a number of tricks to investigate the search space via hill climbing without getting stuck at local optima. They do that by treating the search space as a bit string, then randomly flipping bits (mutation) or swapping bits (sexual reproduction) to hop to related hills in the search space. Then the bit string is converted back to instructions and tested to see if it performs the desired algorithm.
The bit string usually encodes the tree form of a Lisp program to minimize syntax. We can think of it as if every token is encoded in bits (like Huffman encoding inspired by Morse code) For example, the tokens in a (+ 1 2) expression might have the encoding 00, 01 and 10, so the bit string would be 000110, and we can quickly explore all 2^3 = 8 permutations (2^6 = 64 if we naively manipulate an uncompressed bit string whose encoded token sizes vary).
Note that many of the bit strings like (+ + 1) or (2 1 +) don't run. So guard rails can be added to reduce the search space, for example by breaking out early when bit strings throw a compiler exception, or using SAT solvers or caching to weed out nonviable bit strings.
We could build a superoptimizer with GAs, then transpile between MOS 6502 assembly and Lisp (or even run the MOS 6502 assembly directly in a sandbox) and not have to know anything about how the processor works. To me, this is the real beauty of GAs, because they allow us to solve problems without training, at the cost of efficiency.
I don't think that LLMs transpile to Lisp when they're designing algorithms. So it's interesting that they can achieve high complexity and high efficiency via training, without even having verification built-in. Although LLMs trained on trillions of parameters running on teraflops GPUs with GBs of memory may or may not be viewed as "efficient".
I suspect that someday GAs may be incorporated into backpropagation to drastically reduce learning time by finding close approximations to the matrix math of gradient descent. GAs were just starting to be used to pseudorandomly produce the initial weights of neural nets around 2000 when I first learned about them.
Also quantum computing (QC) could perform certain matrix math in a fraction of the time, or even preemptively filter out bit strings which aren't runnable. I suspect that AI will get an efficiency boost around 2030 when QC goes mainstream. Which will probably lead us to a final candidate learning algorithm that explains how quantum uncertainty and emergent behavior allow a physical mind to tune into consciousness and feel self-aware, but I digress.
Because modern compilers don't do any of this, and we aren't accustomed to multicore computing, then from a sheer number of transistors perspective, we're only getting a tiny fraction of the computing power that we might otherwise have if we designed chips from scratch using modern techniques. This is why I often say that computers today run thousands of times slower than they should for their transistor budgets.
reply