This is fascinating, and aligned with my experience working on harnesses. I bet there is still significant upside left on the table with that same model.
There was a narrative last year by Anthropic that each new model release had them making the harness closer to a simple while loop with tools, but now it seems to be going in the other direction. There's just so much to explore with harnesses. Rolling context windows (instead of compaction) have been very powerful in my work with agentic harnesses, while keeping a persistent high level summary and a detailed automated feedback pipeline (granted, this is easier said than done if you don't have specific, consistent goals for your agent like we do).
A better benchmark needs to be objectively scored, have multi-disciplinary, breadth, and be scalable (no single correct answer).
That's what we designed at https://gertlabs.com. We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding.
Wow. This benchmark definitely feels more accurate than the other rankings I've seen. My experience with gpt 5.4/5.5 is that they are technically flawless and if there are any technical issues that is because the input didn't provide enough clarity; that's not to say that it doesn't autonomously react to any issues during bug fixes or implementations, but it'll tend to nail its tasks without leaving behind gaps.
Opus otoh is overrated in terms of its technical ability. It is certainly a better designer/developer for beautiful user experiences, but I'll always lean on gpt 5.5 to check its work.
The biggest surprise in the benchmark is Xiao-Mi. I haven't tried it yet, but I will be after looking at this.
Grats on your team for putting together something meaningful to make sense of the ongoing AI speedrun! Great work!
Are we looking at the same data? On that site I see that opus 4.7's and gpt 5.5's g scores are within each others confidence intervals, and both significantly ahead of the number 3 model.
Your comment makes it sound like they are miles apart, which the benchmark doesn't seem to support.
Edit:
I looked at the data more and the two models are only basically equal when looking at the mean of all the tests. Gpt 5.5 significantly outperforms opus 4.7 in coding, while opus 4.7 significantly outperforms in "decision making." I'm not seeing details on what decision making explicitly means.
Decision making refers to the environments where the LLM is called on every tick (like games with social communication), examples here: https://gertlabs.com/spectate.
Because GPT 5.5 just launched and those games take longer to accumulate data for, it just doesn't have enough samples yet. It will end up with a wider lead on Opus, I am sure. Coding evals always have large sample sizes on day 1. Good find, we should probably better adjust the weighting here for decision games with low match counts.
Right, I'm including my own observations in what the leaderboard is showing. Could be confirmation bias, but I use both Opus and GPT extensively and since GPT 5.4 I have noticed that Opus doesn't even begin to touch GPT's level of technical depth. I was hoping Opus 4.7 would close that gap, but unfortunately it doesn't even compare to GPT 5.4 in that sense.
I'm not being a hater, I love Opus for different reasons, but I can't rely on it for its technical ability.
amazing to see Claude Code top models still way above all other models for C++ & Java, while GPT 5.5 is higher in Python & JS and others. Shows the skew in the training data sets, and maybe the go-to-market focus - with Anthropic focusing on enterprise customers much more than OpenAI?
Matches with my experience with Opus for C++.
C# results are empty - @gertlabs - any ETA for those?
C# testing is a new feature added a few days ago from HN comment suggestions, samples will continue growing. Most C# data is currently for non-agentic workloads: https://gertlabs.com/?mode=oneshot_coding
It's a surprising result, and a lot of it stems from the Pro variant struggling with our custom harness in agentic tasks (whereas Flash does fine), as well as provider instability. Failed requests are not counted against the model in its score, but it's possible there are additional silent degradations even on successful requests.
Either that, or Flash is truly a better architecture and the Pro variant is heavily benchmaxxed. It wouldn't be the first time we saw something like that in our benchmarking. We collect samples every week so it'll be interesting to see if it rebalances over time as new providers host the model. Flash is great though; it's so fast and cheap.
Our philosophy is that you can design problems so that they can scale through a few release cycles by making environments more complex, with no known ceiling. The key for scalability is not having a single correct answer (even though Victor's benchmark is interesting), but still being objectively scorable.
That's what we've done with our comprehensive reasoning and coding benchmark at https://gertlabs.com
Self organizing systems is an area of research to which I think LLMs will contribute immensely.
But as of now, even newer AI models are not particularly insightful. I'm always surprised by how suboptimal near-frontier LLMs are at collaborating in some of the easier cooperative environments on my benchmarking and RL platform. For example, check out a replay of consensus grid here: https://gertlabs.com/spectate
While interesting, its not clear to me with just looking at concensus grid how they are prompted.
Do you tell them to think and coordinate the next step through some type of sync/talking mechanism or is it turn by turn?
I suspect turn by turn as it is similiar to other experiements and in this case, it wouldn't work because they wouldn't have a certain amount of time to think about the next step together?
All of our environments are tick based (with ticks of varying speeds), and this is explained in the prompt given to the models, along with the latest observation and a history of recent events/conversations/actions.
So that does make the game more challenging, versus some other simulations we have where multiple conversation turns happen before action. But the inefficiencies I'm describing are different; for example, an agent reaches part of the destination area but is clearly blocking another player who needs to pass, and most models will just stay put instead of moving along to another target spot.
So is "Game Overview" the prompt? Because i can't seem to see any indication / hint given to the models that its a game they should work together on and commmunicate etc.
The agent makes a copy of itself in /tmp/. Runs. Evaluates. Updates itself. Makes a copy of itself. Runs. Evaluates. Updates itself. Makes a ...... you get the idea.
They will not stop if the recursion is given a hard to meet termination condition. Also, if it can cheat to solve the termination condition it will.
Early takeaways: from this release, DeepSeek V4 Flash is the model to pay attention to here. It's cheap, effective, and REALLY fast.
The Pro model is slow, not much better in coding reasoning so far when it works, and honestly too unreliable and rate limited to be of much use, currently. Hopefully that improves as new providers host the model. Flash is working fine, and is currently performing competitively with recent releases, but only on agentic workflows. Check back in 24 hours for full combined scoring with tool use and long context for both models.
Many of the frontier Chinese AI labs have released near-frontier models that are just a little bit behind Opus 4.6 in terms of speed, tool use ability, or long context handling. Open weights are winning the AI race, led by China. Crazy couple weeks of releases.
Mimo V2.5 Pro by Xiaomi (not open weights) is actually the best performer of the latest string of Chinese releases in our combined, comprehensive benchmarks, despite getting less attention. Kimi K2.6 is the most interesting open weights release, still. DeepSeek is not the leader in the space anymore.
An interesting pattern with the latest string of Chinese releases is the much better agentic boost (models are not as smart out of the box, but their ability to iterate in a loop with tools makes up most of the difference). Deepseek V4 Flash exemplifying this -- not a smart model on the first try, but it makes up for it over the course of a session.
I would say all benchmarks are inherently subjective. How is yours better? It seems to produce a little bit strange results. Opus 4.6 being worse than 4.5 for example. Or chinese models being rated too high. Kimi, Deepseek or GLM are all great in open source world, but I don't believe they are ahead of SOTA models from Anthropic, OpenAI or Google.
No, some benchmarks are definitely objective, but most can be easily gamed. For example, most of the benchmarks on the model cards: they have measurable answers that don't rely on a human judge (a human made the question, but the answers are measuring some uncontroversial knowledge or capability). But because there is a single, correct answer, and those answer leak (or are randomly discovered and optimized for in training), they lose value over time, and regardless, they have a ceiling on the intelligence they can measure.
Others are purely subjective, like LMArena, which really only measures the personality and style preferences of the masses at this point, because frontier LLM technical answers are too hard for the average person to judge.
Then there are some interesting one-off benchmarks, but they lack enough rigor, breadth, and samples to draw larger conclusions from.
So we designed our benchmark with 3 goals: objective measurements (individual submissions not dependent on a human or LLM judge), no known correct answer (so simulations can scale to much higher levels of intelligence), and enough variety over important aspects of intelligence. We do this by running multiple models in cooperative/competitive environments with very complex action spaces and objective scoring, where model performance is relative and affected by the actions of other participants.
And yeah, there are some interesting results when you have a more objective benchmark. It should raise eyebrows when every single sub-release of every company's model is better across the board than its predecessor -- that isn't reality.
I agree that benchmarks are inherently subjective.
but the fact that you cite your brief as your main argument is funny - you don't even have any inherently subjective numbers to justify what you believe, you only have "I don't believe".
Sure, I have mixed up two things together. I don't think this benchmark is bad, I just did not like it is presented as the ultimate objective truth. The other thing I have mentioned is that it delivers different results from other benchmarks, so the "believe" stems from other benchmarks.
you are arguing with your belief instead of an objective truth. benchmark is more objective, if you don't agree with it, come up with a better one. but what you believe doesn't matter.
It was not a confrontational take. But all benchmarks are designed by humans, we are not that great at measuring intelligence. So it is somewhat subjective. I was just arguing with the word "objective". Not with the results per se.
Only if the benchmark is private and done properly on relevant tasks, which is rarely the case. I can guarantee that you have a ton of blind spots if you look at it through the lens of a ranking ladder in some generic tasks.
I'm particularly interested in it being REALLY fast - do you have any rough tok/s numbers for the flash model? I'm excited for unsloth to drop some quants that I can try and run locally, but really curious how it's been performing speed wise. In general I actually over-index on speed over intelligence. I'd rather a model make mistakes quickly and correct in a follow-up than take forever to get a slightly better initial result.
Take a look at the Time column in https://gertlabs.com/?mode=oneshot_coding -- this is the total time to complete a solution for a reasonably complex problem end-to-end (you would have to divide by avg submission size to estimate tok/s). It's fast in the sense that most of the smart, recent Chinese releases are quite slow, especially the DeepSeek Pro variant. Opus 4.7 is also quite fast.
If pure speed is most important for your use case, GPT-5.3 Chat is the fastest model we've tested and it's still reasonably smart. Not meant for agentic tool usage / long context, though.
So it might be more useful for business applications or non-engineering usage where you don't need exceptional intelligence, but it's useful to get fast, cheap responses.
I see the value in that, but there are a few reasons that isn't on the immediate roadmap -- mainly, it shifts focus from measuring the model to measuring the harness. The agentic benchmark section you see on the site is comparable to how an agent would perform using an open harness like Pi. But latest tool-using models are pretty well adapted to any harness, so I think that's less of a factor in overall model performance.
We've been working on a way to address the obvious problems with existing benchmarks, by creating a single comprehensive benchmark that measures things that technical people care about, while also getting as close to an objective, "core intelligence" measurement as possible.
Some demo games are shown on /spectate that gives you an idea of how we test models and why this would be difficult to benchmax. I think our benchmark is by far the best relative measurement of artificial intelligence out there. Feedback is welcome and usually acted upon quickly.
Early benchmarks show tremendous improvement over Kimi K2 Thinking, which didn't perform well on our benchmarks (and we do use best available quantization).
Kimi K2.6 is currently the top open weights model in one-shot coding reasoning, a little better than GLM 5.1, and still a strong contender against SOTA models from ~3 months ago (comparable to Gemini 3.1 Pro Preview).
Agentic tests are still running, check back tomorrow. Open weights models typically struggle with longer contexts in agentic workflows, but GLM 5.1 still handled them very well, so I'm curious how Kimi ends up. Both the old Kimi and the new model are on the slower side, so that's a consideration that makes them probably less usable for agentic coding work, regardless. The old Kimi K2 model was severely benchmaxxed, and was only really interesting in the context of generating more variation and temperature, not for solving hard problems. The new one is a much stronger generalist.
Overall, the field of open weights models is looking fantastic. A new near-frontier release every week, it seems.
Cool website. I don't understand enough about the various benchmarks or how they're done to judge whether or not anything is accurate, but I love the layout and features especially the spectator feature which is pretty cool. One thing, I saw the "Market simulator" spectator feature but didn't see a corresponding benchmark for that. Is it "Finance" or "Betting" or "Trading"?
In terms of raw token cost, I've seen a couple providers at (all prices in terms of Mtok) $0.95 input/$0.15 cache input/$5 output vs $3 input/$15 output for sonnet.
Task prices of courses will be more interesting - a dumber model may use more tokens to get to the same goal.
I'm looking at your table now - is there a reason why you don't include cost? If Opus 4.7 is the winner but costs e.g. 5x as much, that's important information.
We recently added cost (last week), so data is sparse. Check back in a few weeks and it will be represented somewhere on the homepage, probably in the Efficiency Chart at the bottom. We also plan to show model performance deviation over time after we collect more data.
I'm interested to hear about any other data representations you'd like to see, too. The goal is to convey the most important information as densely as possible, without too much clutter.
Not sure what you mean. Time series chart of model performance over time to see if proprietary models get degraded? That's in the works, but we will need a couple months more data collection before launch.
The idea is that the larger a coding task is and the longer the coding agent is, the higher the chance is for the agent to not follow the rules and guidelines.
We will as soon as API access is widely available. Once a model goes live, we typically have one-shot reasoning benchmarks up in ~8 hours and comprehensive agentic/combined benchmarks up after 24-48 hours. We're working on building relationships with each lab to have the results before launch.
It's interesting; I can only speculate as to the underlying reason. When given enough time, models outperform in Rust/C++ in longer agentic tasks, and actually perform worst in Python. For tasks that aren't judged on code speed. https://gertlabs.com/?mode=agentic_coding
It makes sense when you consider LLMs don't generalize very well, so they're heavily dependent on how good (how varied as well as how high quality) the training data is
Well it might explain why pro-Claude vs pro-Codex people keep talking past each other on this forum. I see people all the time assuming that anybody who likes Codex must be some sort of bot because of their own biases, but I work almost exclusively in Rust and find Codex extremely competent (and a much better overall engineer), don't trust Claude/Opus at all... but I see in this bench it scores lower on TypeScript etc. than Opus does.
Good question. We missed that release entirely. Our automated model checker only went live 2 months ago so they were manually curated prior to that. I'm adding it now. It'll be live in ~12 hours.
Can you add C# to supported languages? It's widely used and it be helpful for people and companies to see how different models fare against each other.
There was a narrative last year by Anthropic that each new model release had them making the harness closer to a simple while loop with tools, but now it seems to be going in the other direction. There's just so much to explore with harnesses. Rolling context windows (instead of compaction) have been very powerful in my work with agentic harnesses, while keeping a persistent high level summary and a detailed automated feedback pipeline (granted, this is easier said than done if you don't have specific, consistent goals for your agent like we do).
reply