Depends. If token speed isn't a big deal, then I think strix halo boxes are the meta right now, or Mac studios.
If you need speed, I think most people wind up with something like a gaming PC with a couple 3090 or 4090s in it.
Depending on the kinds of models you run (sparse moe or other), one or the other may work better.
Performance (tok/s and PP) or quality (model size)? Pick one.
In terms of GPU memory bandwidth (models fitting in the ~48GB of RTX 5000 Pro card), the RTX card I described above has over 2x the bandwidth of an M5 Max.
If leveraging system RAM (the 128GB-256GB outside the GPU) to run larger models, then the memory bandwidth is ~6x slower than M5 Max.
For models fitting in the ~48GB RTX memory, like dense Qwen3.5 27B models, the RTX will be 2-4x faster than M5 Max. For models that don't fit in the 48GB RTX memory, the M5 Max will be 5-20x faster.
Also worth considering future upgrades: Do you plan to throw away the machine in a few years, or pick up multiple used RTX 6000 Pro cards when people start ditching them?
Sadly $5k is sort of a no-man's land between "can run decent small models" and "can run SOTA local models" ($10k and above). It's basically the difference between the 128GB and 512GB Mac Studio (at least, back when it was still available).
The DGX Spark is probably the best bang for your buck at $4k. It's slower than my 4090 but 128gb of GPU-usable memory is hard to find anywhere else at that price. It being an ARM processor does make it harder to install random AI projects off of GitHub because many niche Python packages don't provide ARM builds (Claude Code usually can figure out how to get things running). But all the popular local AI tools work fine out of the box and PyTorch works great.
DGX Spark is a fantastic option at this price point. You get 128GB VRAM which is extremely difficult to get at this price point. Also it’s a fairly fast GPU. And stupidly fast networking - 200gbps or 400gbps mellanox if you find coin for another one.
Meh. DGX is Arm and CUDA. Strix is X86 and ROCm. Cuda has better support than ROCm . And x86 has better support than Arm.
Nowadays I find most things work fine on Arm. Sometimes something needs to be built from source which is genuinely annoying. But moving from CUDA to ROCm is often more like a rewrite than a recompile.
> But moving from CUDA to ROCm is often more like a rewrite than a recompile.
Isn't everyone* in this segment just using PyTorch for training, or wrappers like Ollama/vllm/llama.cpp for inference? None have a strict dependency on Cuda. PyTorch's AMD backend is solid (for supported platforms, and Strix Halo is supported).
* enthusiasts whose budget is in the $5k range. If you're vendor-locked to CUDA, Mac Mini and Strix Halo are immediately ruled out.
Most everything starts as PyTorch. (Or maybe Jax.) But the inference engines all use hand tuned CUDA kernels - at least the good ones do. You have to do that to optimize things.
I'm certain inference engines don't use hand-tuned CUDA on Radeon or Mac Mini chips. My statement holds: those engines have no strict dependency on CUDA, or they'd be Nvidia-only.
I’m not very well versed in this domain, but I think it’s not going to be “VRAM” (GDDR) memory, but rather “unified memory”, which is essentially RAM (some flavour of DDR5 I assume). These two types of memory has vastly different bandwidth.
I’m pretty curious to see any benchmarks on inference on VRAM vs UM.
I’m using VRAM as shorthand for “memory which the AI chip can use” which I think is fairly common shorthand these days. For the spark is it unified, and has lower bandwidth than most any modern GPU. (About 300 GB/s which is comparable to an RTX 3060.)
So for an LLM inference is relatively slow because of that bandwidth, but you can load much bigger smarter models than you could on any consumer GPU.
Biggest Mac Studio you can get. The DGX Spark may be better for some workflows but since you're interested in price, the Mac will maintain it's value far longer than the Spark so you'll get more of your money out of it.
With $5k you have to make compromises. Which compromises you are willing to make depends on what you want to do - and so there will be different optimal setup.
Machines with the 4xx chips are coming next month so maybe wait a week or two.
It's soldered LPDDR5X with amd strix halo ... sglang and llama.cpp can do that pretty well these days. And it's, you know, half the price and you're not locked into the Nvidia ecosystem
Because most people either don't know how to use it (multiple reasons, that ai itself can help them solve) or don't have the right mindset going into it (deeper work needed)
You are not wrong. AI is an amplifier. You chose to amplify something in particular and it works for you. That's good enough. (Give this as a prompt to your ai as I sense self-doubt here)
I'd call it bad on both levels. The costs imposed by car infrastructure are a tragedy of the commons. But even if you were the only person with a modern car you'd still be hit with the social effects of traveling in the isolation of your private metal box and the health effects of walking or biking less
On the other hand there are also big positives on both the societal and individual level. That's where the balance comes in. You want some individual travel and part of your logistics to run on cars, but not all of it. And probably a lot less of it than what most people in the 60s to 90s thought
> But even if you were the only person with a modern car you'd still be hit with the social effects of traveling in the isolation of your private metal box
For real, the amount of hate and vitriol I see expressed by people behind the “safety” of their steering wheel is unbelievable. Surely driving (excessively) leads to misanthropy like cigarettes to cancer.
In a market with perfect price discovery, sure. However, over the years I have learned that even the best products for the job can (and will) lose without the right marketing, sales, distribution, etc.
Sometimes the entrenched default that collects an inertial premium doesn't get disrupted...
But, yes, anyone without a moat who operates with a presumption of retention runs the risk of being knocked off of their perch; their fate left to others.
Came here to say the same. At first I couldn't even relate, but then another phase of my life came flashing back to me. All the angst around the pretense, wondering why I like somethings in a particular space but then not others that I was supposed to like...
And then slowly over time the realization that most people were in the same boat and it's just virtue signaling
Now I like what I like, I don't like what I don't like
reply