I’m building something similar. It’s not public yet because it’s still early and I’m still working on exactly what it is supposed to be.
But the idea is similar in that I start with a spec and feed the LLM context that is a projection of the code and spec, rather than a conversation. The context is specific to the specific workflow stage (eg planning needs different context to implementing) and it doesn’t accumulate and grow (at least, the growth is limited and based on the tool call loop, not on the entire process).
My main goals are more focused context, no drift due to accumulated context, and code-driven workflows (the LLM doesn’t control the RPI workflow, my code does).
It’s built as a workflow engine so that it’s easy for me to experiment with and iterate on ideas.
I like your idea of using TOML as the artifact that flow between workflow stages, I will see if that’s something that might be useful for me too!
Very much the same thinking. Ossature already structures work that way at the plan level during audit, so curious to see where you take it. Happy to share more about the TOML approach if useful. Feel free to reach out (me at my domain)
I’ve never had any problems with MiniMax. I wouldn’t call the speed fast exactly, but it’s faster than GLM and seems similar to Opus.
It’s been fast enough that I’ve been using it as my main model (M2.7 and before that, M2.5). Opus still does better at tasks, but MiniMax is so much cheaper. I’ve used their cheaper plan and I’ve never been rate limited.
Anthropic have done a lot of things that would give me pause about trusting them in a professional context. They are anything but transparent, for example about the quota limits. Their vibe coded Claude code cli releases are a buggy mess too. Also the model quality inconsistency: before a new model release, there’s a week or two where their previous model is garbage.
A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.
This isn’t what I want from a professional tool. For business, we need consistency and reliability.
I’m a huge user of AI coding tools but I feel like there has been some kind of a zeitgeist shift in what is acceptable to release across the industry. Obviously it’s a time of incredibly rapid change and competition, but man there is some absolute garbage coming out of companies that I’d expect could do better without much effort. I find myself asking, like, did anyone even do 5 minutes of QA on this thing?? How has this major bug been around for so long?
“It’s kind of broken, maybe they will fix it at some point,” has become a common theme across products from all different players, from both a software defect and service reliability point of view.
I mean it's like, really they don't even need agentic AI or whatever, they could literally just employ devs and it wouldn't make a difference
like, they'll drop $100 billion on compute, but when it comes to devs who make their products, all of a sudden they must desperately cut costs and hire as little as possible
to me it makes no sense from a business perspective. Same with Google, e.g. YouTube is utterly broken, slow and laggy, but I guess because you're forced to use it, it doesn't matter. But still, if you have these huge money stockpiles, why not deploy it to improve things? It wouldn't matter anyways, it's only upside
I don’t think they’re even saving much on vibe coding it, given how many tokens they claim they’re using. I know the token cost to them is much, much lower than the token cost to us, but it still has a cost in terms of gpus running.
Plus it’s not something we can replicate since we don’t have access to infinite tokens, so it’s not even a good dogfooding case study.
Remember how Sam spent like a year talking about how scary close GPT-5 was to AGI and then when it did finally come out... it was kinda meh.
reply