aid-ninja's comments

aid-ninja · 2026-04-09T03:11:33 1775704293

a rust version of that compiler (that the project runs on) ran at 480k claims/sec and it was able to deterministically resolve 83% of conflicts across 1 million concurrent agents (also 393,275x compression reduction @ 1m agents on input vs output, but different topics can make the compression vary)

natively claude (and other LLM) will resolve conflicting claims at about 51% rate (based on internal research)

the built in byzantine fault tolerance (again, in the compiler) is also pretty remarkable, it can correctly find the right answer even if 93% of the agents/data are malicious (with only 7% of agents/data telling us the correct information)

basically the idea here is if you want to build autonomous at scale, you need to be able to resolve disagreement at scale and this project does a pretty nice job at doing that

lmeyerov · 2026-04-09T04:02:35 1775707355

My question was on claims like "5x productivity boost in merged PRs (lots of open PR & merge rate goes down, but net positive)", eg, does this change anything on swe-bench or any other standard coding eval?

volatilityfund · 2026-04-09T06:38:24 1775716704

The ecosystem is 8 tools plus a claude code plugin, the unlock was composing those tools (I don't regularly use all 9). The 5x claim was from /insights (claude code)

Not for everyone, but it radically changed how I build. Senior engineer, 10+ years

Now it's trivial to run multiple projects in parallel across claude sessions (this was not really manageable before using wheat)

Genuinely don't remember the last time I opened a file locally

lmeyerov · 2026-04-09T20:33:46 1775766826

It sounds like the answer is "No, there is no repeatable eval of the core AI coding productivity claim, definitely not on one of the many AI coding benchmarks in the community used for understanding & comparison, and there will not be"

volatilityfund · 2026-04-10T03:58:20 1775793500

My data is from Anthropic

Not sure how it works under the hood, probably a better question for them

Perhaps you are misunderstanding the entire premise of this project, this is not an LLM

lmeyerov · 2026-04-11T08:45:53 1775897153

Maybe there's a fundamental miscommunication here of what evals are?

Evals apply not just to LLMs but to skills, prompts, tools, and most things changing the behavior of compound AI systems, and especially like the productivity claims being put forth in this thread.

The features in the post relate directly to heavily researched areas of agents that are regularly benchmarked and evaluated. They're not obscure, eg, another recent HN frontpage item benchmarked on research and planning.

aid-ninja · 2026-04-11T21:57:13 1775944633

your question makes sense, it's just not in current scope

we are still benchmarking the compiler at scale and the LLM tools that were made were created as functional prototypes to showcase a single example of the compiler's use case

since much of the unlock here is finding different applications for the compiler itself, we simply don't have the bandwidth to do much benchmarking on these projects on top of maintaining the repos themselves

all the code is open source and there is nothing stopping anyone from running their own benchmarks if they were curious

btw

https://news.ycombinator.com/item?id=47733217

aid-ninja · 2026-04-06T05:11:07 1775452267

I broadly agree with you but if you use claude code you should give this a try, the website doesn't really do it justice but wheat really solves for a lot of pain points when using claude for longer sessions

aid-ninja · 2026-04-06T05:01:47 1775451707

if you like wheat, check out farmer & orchard (both are really cool)

aid-ninja · 2026-04-06T04:51:39 1775451099

I literally just tell claude, here is the deekwiki please use this tool for the next task lol

HN For You