For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | dkersten's commentsregister

And every single time what they release is underwhelming.

Remember how Sam spent like a year talking about how scary close GPT-5 was to AGI and then when it did finally come out... it was kinda meh.


I’m building something similar. It’s not public yet because it’s still early and I’m still working on exactly what it is supposed to be.

But the idea is similar in that I start with a spec and feed the LLM context that is a projection of the code and spec, rather than a conversation. The context is specific to the specific workflow stage (eg planning needs different context to implementing) and it doesn’t accumulate and grow (at least, the growth is limited and based on the tool call loop, not on the entire process).

My main goals are more focused context, no drift due to accumulated context, and code-driven workflows (the LLM doesn’t control the RPI workflow, my code does).

It’s built as a workflow engine so that it’s easy for me to experiment with and iterate on ideas.

I like your idea of using TOML as the artifact that flow between workflow stages, I will see if that’s something that might be useful for me too!


Very much the same thinking. Ossature already structures work that way at the plan level during audit, so curious to see where you take it. Happy to share more about the TOML approach if useful. Feel free to reach out (me at my domain)

I find Kimi white good if you ask it for critical feedback.

It’s BRUTAL but offers solutions.


what is Kimi white?

I was typing quickly on my phone. I meant “quite”, “I find Kimi quite good”

Not soft, not mild, but BRUTAL! This broke my brain!

I’ve never had any problems with MiniMax. I wouldn’t call the speed fast exactly, but it’s faster than GLM and seems similar to Opus.

It’s been fast enough that I’ve been using it as my main model (M2.7 and before that, M2.5). Opus still does better at tasks, but MiniMax is so much cheaper. I’ve used their cheaper plan and I’ve never been rate limited.


I’ve also never hit the MiniMax limits and M2.7 is pretty good.

Not as good as Opus, but substantially cheaper!


They could brand it as “New Windows”


Out with the old, in with the GNU.


> Did anyone actually use a MiniMax model and it actually worked?

Works well for me in Kilo Code. Its not as good as Opus, but its substantially cheaper and gets the job done. I don't have any problems with it.


> At least the LLM will only take 5 minutes to tell you they don't know what to do.

In my experience, the LLM will happily try the wrong thing over and over for hours. It rarely will say it doesn’t know.


Don’t ask it to make changes off the bat, then - ask it to make a plan. Then inspect the plan, change it if necessary, and go from there.


I do. I tend to follow a strict Research, Plan, Implement workflow. It does greatly help, but it doesn’t eliminate all problems.


Anthropic have done a lot of things that would give me pause about trusting them in a professional context. They are anything but transparent, for example about the quota limits. Their vibe coded Claude code cli releases are a buggy mess too. Also the model quality inconsistency: before a new model release, there’s a week or two where their previous model is garbage.

A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.

This isn’t what I want from a professional tool. For business, we need consistency and reliability.


> vibe coded Claude code cli releases are a buggy mess too

this is what gets me.

are they out of money? are so desperate to penny pinch that they can't just do it properly?

what's going on in this industry?


I get the value of dogfooding, but I feel that in this case, a solid trustworthy foundation is much more important than dogfooding.


I’m a huge user of AI coding tools but I feel like there has been some kind of a zeitgeist shift in what is acceptable to release across the industry. Obviously it’s a time of incredibly rapid change and competition, but man there is some absolute garbage coming out of companies that I’d expect could do better without much effort. I find myself asking, like, did anyone even do 5 minutes of QA on this thing?? How has this major bug been around for so long?

“It’s kind of broken, maybe they will fix it at some point,” has become a common theme across products from all different players, from both a software defect and service reliability point of view.


I mean it's like, really they don't even need agentic AI or whatever, they could literally just employ devs and it wouldn't make a difference

like, they'll drop $100 billion on compute, but when it comes to devs who make their products, all of a sudden they must desperately cut costs and hire as little as possible

to me it makes no sense from a business perspective. Same with Google, e.g. YouTube is utterly broken, slow and laggy, but I guess because you're forced to use it, it doesn't matter. But still, if you have these huge money stockpiles, why not deploy it to improve things? It wouldn't matter anyways, it's only upside


I don’t think they’re even saving much on vibe coding it, given how many tokens they claim they’re using. I know the token cost to them is much, much lower than the token cost to us, but it still has a cost in terms of gpus running.

Plus it’s not something we can replicate since we don’t have access to infinite tokens, so it’s not even a good dogfooding case study.


Code alone can never describe intent or rationale.


Indeed, you need both!

But documentation should not go too deep in the "how" otherwise it risks telling a lie after a while as the code changes but the documentation lags.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You