while unified memory may offer better performance than unsoldered DDR system memory, it still won't be as great as 1.8TB/s bandwidth on high end consumer GPUs right now.
nvidias master plan may be making it the new normal to have "only" 400GB/s bandwidth, thus gatekeeping local model usage further behind "more memory but not as fast as the cloud can do it"
I think it’s an interesting theory but a bit too conspiracy theory-ish.
Nvidia just wants to sell stuff to everyone.
And I think for professionals doing local AI work, products like Strix Halo and Apple Silicon are a competitive threat.
A big part of maintaining the leading software ecosystem is ensuring you have competitive hardware for all your users.
I also think the RTX Spark product is relatively low effort for Nvidia. Grab a Mediatek CPU and slap an Nvidia GPU on the die. Sure, that’s oversimplifying it, but still.
For a little complex changes, I always run codex (5.5-high) in planning mode first.
I have linked various docs/{ARCHITECTURE,BACKEND-GUIDELINES,NESTJS-DI,..}.md etc. from AGENTS.md so they can quickly discover relevant docs at planning time, only if they are needed. No need to know react specific stuff when it's dealing with a backend problem for example. I typically blindly approve plans made by the agent with a fresh context, because that's as if I had prompted it. Works the best for me.
Using /goal however, it's really just constantly compacting and doing it's thing, of course it gets sloppy. If only there was a state machine that would transform tickets into a Planning Mode Prompt, then use, idk. guardian approvals (somehow a "Product Management Perspective Lens" approving or making changes to the plan) and then letting a less capable or less reasoning agent execute the plan, I think that would work the best.
If you hadn’t written that post using AI, it might’ve received more attention. Also, (1) if you’d put LinkedIn in the title, rather than the very bottom of the post, and (2) if you’d provided any insight, rather than just speculation, as to what the data might be being used for.
I have written something about Linkedin although not, about browser fingerprinting but certainly somewhat of an extremely bad experience with Linkedin.
Not sure if this counts but my post was actually sandwiched between two large Linkedin posts (the 2 tabs = 8 gb and now this) within the timing [0]
I always write things myself, even if they might take hours.
But I also believe that my post had overlapped with larger things of AI (OpenAI getting funded, Claude being leaked), I have seen some cool projects lately on Hackernews which aren't getting attention as all of that attention gets redirected to AI related news.
[0]: to be honest, I write things for myself firstly and I just upload them here for discussion related purposes, I am perfectly fine with my posts not reaching traction, because, I try to/wish to write for myself first and foremost :), Also within that Linkedin incident, In that case I just wrote things to get it off my chest really.
Thank you for that post, it describes the invasion of privacy at a deeper level. I must have missed it but YCombinator is filled with people with a vested interest in keeping the clown show going.
How do you arrive at that number? I find it hard to make sense of this ad hoc, given that the total token cost is not very interesting; it's token efficiency we care about.
> prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.
which is basically maxxed out quickly. So there is 2x (the first lever)
Then there is the /fast mode, which they state costs 2x more (for 1.5x speedup)
And then there is the model base price ($2.50 vs $1.75), well yeah thats 42% increase. It is in fact a 5.7x total increase of token cost in fast mode and large context. (Sorry for the confusion, I thought it was 8x because I thought gpt-5.3-codex was $1.25)
(After a day of usage, I am relatively certain in practice this does not end up being a 5.7x cost increase or anything close to that, though I am still fairly unclear on what that computation is worth to begin with, given that I am entirely fine with the model using the least amount of tokens possible to get the job done)
1. it's 1.5x , it's quite fast for the level of thinking it has
2. no if you are on subscription, it's the same, at 20$ codex 5.4 xhigh provide way more than 20$ opus thinking ( this one instead really can burn 33% with 1 request, try to compare then on same tasks ) also 8x .. ??? if you need 1M token for a special tasks doesn't hit /fast and vice-versa , the higher price doesn't apply on subscription too..
3. false, i'm on pro , so 10x the base , always on /fast (no 1M), and often 2 parallel instances working.. hardly can use 2% (=20% of 5h limit , in 1h of work ( about 15/20 req/hour) ) , claude is way worse on that imo
But at one point the model is sufficiently large enough to accomplish any task a human could specify. For software development, I think we're pretty much at that point with the latest Anthropic/Google/OpenAI models. We have no idea where the direction of token pricing is going to go in the future, but the consensus seems to be that it will only get more expensive. If Taalas can offer the same functionality that we have with frontier models today at a 1/10 of the cost and 10x the speed then they're going to take over a large part of the market.
At this point, the pelican benchmark became so widely used that there must be high quality pelicans in the dataset, I presume. What about generating an okapi on a bicycle instead?
nvidias master plan may be making it the new normal to have "only" 400GB/s bandwidth, thus gatekeeping local model usage further behind "more memory but not as fast as the cloud can do it"
reply