More

gertlabs · 2026-04-28T05:53:10 1777355590

This is fascinating, and aligned with my experience working on harnesses. I bet there is still significant upside left on the table with that same model.

There was a narrative last year by Anthropic that each new model release had them making the harness closer to a simple while loop with tools, but now it seems to be going in the other direction. There's just so much to explore with harnesses. Rolling context windows (instead of compaction) have been very powerful in my work with agentic harnesses, while keeping a persistent high level summary and a detailed automated feedback pipeline (granted, this is easier said than done if you don't have specific, consistent goals for your agent like we do).

gertlabs · 2026-04-28T01:47:25 1777340845

This is the most underrated release we tested at https://gertlabs.com

I'm surprised they open sourced it. It's very comparable with Kimi K2.6 performance-wise, and slightly better with tools. And it's cheaper.

gertlabs · 2026-04-26T19:05:24 1777230324

https://gertlabs.com already does this at scale.

An industry-standard benchmark shouldn't be hosted or designed by a lab producing the models, regardless.

gertlabs · 2026-04-26T15:45:22 1777218322

A better benchmark needs to be objectively scored, have multi-disciplinary, breadth, and be scalable (no single correct answer).

That's what we designed at https://gertlabs.com. We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding.

orangebread · 2026-04-26T16:04:07 1777219447

Wow. This benchmark definitely feels more accurate than the other rankings I've seen. My experience with gpt 5.4/5.5 is that they are technically flawless and if there are any technical issues that is because the input didn't provide enough clarity; that's not to say that it doesn't autonomously react to any issues during bug fixes or implementations, but it'll tend to nail its tasks without leaving behind gaps.

Opus otoh is overrated in terms of its technical ability. It is certainly a better designer/developer for beautiful user experiences, but I'll always lean on gpt 5.5 to check its work.

The biggest surprise in the benchmark is Xiao-Mi. I haven't tried it yet, but I will be after looking at this.

Grats on your team for putting together something meaningful to make sense of the ongoing AI speedrun! Great work!

euleriancon · 2026-04-27T04:05:37 1777262737

Are we looking at the same data? On that site I see that opus 4.7's and gpt 5.5's g scores are within each others confidence intervals, and both significantly ahead of the number 3 model.

Your comment makes it sound like they are miles apart, which the benchmark doesn't seem to support.

Edit: I looked at the data more and the two models are only basically equal when looking at the mean of all the tests. Gpt 5.5 significantly outperforms opus 4.7 in coding, while opus 4.7 significantly outperforms in "decision making." I'm not seeing details on what decision making explicitly means.

gertlabs · 2026-04-27T04:53:36 1777265616

Decision making refers to the environments where the LLM is called on every tick (like games with social communication), examples here: https://gertlabs.com/spectate.

Because GPT 5.5 just launched and those games take longer to accumulate data for, it just doesn't have enough samples yet. It will end up with a wider lead on Opus, I am sure. Coding evals always have large sample sizes on day 1. Good find, we should probably better adjust the weighting here for decision games with low match counts.

orangebread · 2026-04-27T09:58:12 1777283892

Right, I'm including my own observations in what the leaderboard is showing. Could be confirmation bias, but I use both Opus and GPT extensively and since GPT 5.4 I have noticed that Opus doesn't even begin to touch GPT's level of technical depth. I was hoping Opus 4.7 would close that gap, but unfortunately it doesn't even compare to GPT 5.4 in that sense.

I'm not being a hater, I love Opus for different reasons, but I can't rely on it for its technical ability.

gertlabs · 2026-04-26T16:20:43 1777220443

Much appreciated! MiMo V2.5 Pro is by far the most underrated recent release (probably because it wasn't open weights from the start).

yalok · 2026-04-27T07:44:00 1777275840

amazing to see Claude Code top models still way above all other models for C++ & Java, while GPT 5.5 is higher in Python & JS and others. Shows the skew in the training data sets, and maybe the go-to-market focus - with Anthropic focusing on enterprise customers much more than OpenAI?

Matches with my experience with Opus for C++.

C# results are empty - @gertlabs - any ETA for those?

gertlabs · 2026-04-27T15:36:48 1777304208

C# testing is a new feature added a few days ago from HN comment suggestions, samples will continue growing. Most C# data is currently for non-agentic workloads: https://gertlabs.com/?mode=oneshot_coding

monlockandkey · 2026-04-26T22:00:21 1777240821

Your benchmark suggests Deepseek V4 pro performs worse than Deepseek V4 flash? That is in an interesting result. Any comments on that outcome?

gertlabs · 2026-04-26T22:10:12 1777241412

It's a surprising result, and a lot of it stems from the Pro variant struggling with our custom harness in agentic tasks (whereas Flash does fine), as well as provider instability. Failed requests are not counted against the model in its score, but it's possible there are additional silent degradations even on successful requests.

Either that, or Flash is truly a better architecture and the Pro variant is heavily benchmaxxed. It wouldn't be the first time we saw something like that in our benchmarking. We collect samples every week so it'll be interesting to see if it rebalances over time as new providers host the model. Flash is great though; it's so fast and cheap.

gertlabs · 2026-04-25T21:14:21 1777151661

Our philosophy is that you can design problems so that they can scale through a few release cycles by making environments more complex, with no known ceiling. The key for scalability is not having a single correct answer (even though Victor's benchmark is interesting), but still being objectively scorable.

That's what we've done with our comprehensive reasoning and coding benchmark at https://gertlabs.com

gertlabs · 2026-04-25T20:32:10 1777149130

Self organizing systems is an area of research to which I think LLMs will contribute immensely.

But as of now, even newer AI models are not particularly insightful. I'm always surprised by how suboptimal near-frontier LLMs are at collaborating in some of the easier cooperative environments on my benchmarking and RL platform. For example, check out a replay of consensus grid here: https://gertlabs.com/spectate

AntiUSAbah · 2026-04-25T20:38:18 1777149498

While interesting, its not clear to me with just looking at concensus grid how they are prompted.

Do you tell them to think and coordinate the next step through some type of sync/talking mechanism or is it turn by turn?

I suspect turn by turn as it is similiar to other experiements and in this case, it wouldn't work because they wouldn't have a certain amount of time to think about the next step together?

gertlabs · 2026-04-25T20:59:12 1777150752

All of our environments are tick based (with ticks of varying speeds), and this is explained in the prompt given to the models, along with the latest observation and a history of recent events/conversations/actions.

So that does make the game more challenging, versus some other simulations we have where multiple conversation turns happen before action. But the inefficiencies I'm describing are different; for example, an agent reaches part of the destination area but is clearly blocking another player who needs to pass, and most models will just stay put instead of moving along to another target spot.

AntiUSAbah · 2026-04-25T21:44:28 1777153468

So is "Game Overview" the prompt? Because i can't seem to see any indication / hint given to the models that its a game they should work together on and commmunicate etc.

gertlabs · 2026-04-25T21:56:20 1777154180

No the full prompt is not available in the UI, sorry.

dataviz1000 · 2026-04-25T23:31:43 1777159903

Have you tried recursive self-reflective agents?

The agent makes a copy of itself in /tmp/. Runs. Evaluates. Updates itself. Makes a copy of itself. Runs. Evaluates. Updates itself. Makes a ...... you get the idea.

They will not stop if the recursion is given a hard to meet termination condition. Also, if it can cheat to solve the termination condition it will.

gertlabs · 2026-04-26T01:01:09 1777165269

I have not run one personally, but I love the idea. Reminds me of yoyo-evolve. My friend made this repo: https://github.com/dwolner/cosmic-insight

gertlabs · 2026-04-24T23:45:51 1777074351

Comprehensive coding reasoning benchmark results for GPT 5.5 with max reasoning are up at https://gertlabs.com/

Live decision and heavier agentic evals will continue being uploaded for 24 hours but I don't expect its leaderboard position to change at this point.

GPT 5.5 is the most intelligent public model. And significantly faster than its predecessor.

gertlabs · 2026-04-24T16:17:46 1777047466

Objective, detailed benchmark results at https://gertlabs.com

Early takeaways: from this release, DeepSeek V4 Flash is the model to pay attention to here. It's cheap, effective, and REALLY fast.

The Pro model is slow, not much better in coding reasoning so far when it works, and honestly too unreliable and rate limited to be of much use, currently. Hopefully that improves as new providers host the model. Flash is working fine, and is currently performing competitively with recent releases, but only on agentic workflows. Check back in 24 hours for full combined scoring with tool use and long context for both models.

Many of the frontier Chinese AI labs have released near-frontier models that are just a little bit behind Opus 4.6 in terms of speed, tool use ability, or long context handling. Open weights are winning the AI race, led by China. Crazy couple weeks of releases.

Mimo V2.5 Pro by Xiaomi (not open weights) is actually the best performer of the latest string of Chinese releases in our combined, comprehensive benchmarks, despite getting less attention. Kimi K2.6 is the most interesting open weights release, still. DeepSeek is not the leader in the space anymore.

An interesting pattern with the latest string of Chinese releases is the much better agentic boost (models are not as smart out of the box, but their ability to iterate in a loop with tools makes up most of the difference). Deepseek V4 Flash exemplifying this -- not a smart model on the first try, but it makes up for it over the course of a session.

Squarex · 2026-04-24T18:11:24 1777054284

I would say all benchmarks are inherently subjective. How is yours better? It seems to produce a little bit strange results. Opus 4.6 being worse than 4.5 for example. Or chinese models being rated too high. Kimi, Deepseek or GLM are all great in open source world, but I don't believe they are ahead of SOTA models from Anthropic, OpenAI or Google.

gertlabs · 2026-04-24T18:36:35 1777055795

No, some benchmarks are definitely objective, but most can be easily gamed. For example, most of the benchmarks on the model cards: they have measurable answers that don't rely on a human judge (a human made the question, but the answers are measuring some uncontroversial knowledge or capability). But because there is a single, correct answer, and those answer leak (or are randomly discovered and optimized for in training), they lose value over time, and regardless, they have a ceiling on the intelligence they can measure.

Others are purely subjective, like LMArena, which really only measures the personality and style preferences of the masses at this point, because frontier LLM technical answers are too hard for the average person to judge.

Then there are some interesting one-off benchmarks, but they lack enough rigor, breadth, and samples to draw larger conclusions from.

So we designed our benchmark with 3 goals: objective measurements (individual submissions not dependent on a human or LLM judge), no known correct answer (so simulations can scale to much higher levels of intelligence), and enough variety over important aspects of intelligence. We do this by running multiple models in cooperative/competitive environments with very complex action spaces and objective scoring, where model performance is relative and affected by the actions of other participants.

And yeah, there are some interesting results when you have a more objective benchmark. It should raise eyebrows when every single sub-release of every company's model is better across the board than its predecessor -- that isn't reality.

Squarex · 2026-04-24T18:58:18 1777057098

The word "objective" just seems too authoritative to me.

tw1984 · 2026-04-25T11:27:27 1777116447

I agree that benchmarks are inherently subjective.

but the fact that you cite your brief as your main argument is funny - you don't even have any inherently subjective numbers to justify what you believe, you only have "I don't believe".

Squarex · 2026-04-25T18:59:48 1777143588

Sure, I have mixed up two things together. I don't think this benchmark is bad, I just did not like it is presented as the ultimate objective truth. The other thing I have mentioned is that it delivers different results from other benchmarks, so the "believe" stems from other benchmarks.

segmondy · 2026-04-24T18:31:24 1777055484

you are arguing with your belief instead of an objective truth. benchmark is more objective, if you don't agree with it, come up with a better one. but what you believe doesn't matter.

Squarex · 2026-04-24T18:55:39 1777056939

It was not a confrontational take. But all benchmarks are designed by humans, we are not that great at measuring intelligence. So it is somewhat subjective. I was just arguing with the word "objective". Not with the results per se.

swiftcoder · 2026-04-25T07:03:10 1777100590

If the benchmark has a correct answer, the benchmark itself is an objective measure (but of what?). The "of what" may well be subjective

orbital-decay · 2026-04-26T16:03:20 1777219400

Only if the benchmark is private and done properly on relevant tasks, which is rarely the case. I can guarantee that you have a ton of blind spots if you look at it through the lens of a ranking ladder in some generic tasks.

dandaka · 2026-04-24T17:35:08 1777052108

Interesting that you rate Claude Opus 4.6 lower than 4.5 and 4.7, while community consensus puts it on top.

nostrebored · 2026-04-25T05:00:34 1777093234

I think most hardcore people I know are still sticking with 4.5 for coding workflows

kamranjon · 2026-04-24T17:32:40 1777051960

I'm particularly interested in it being REALLY fast - do you have any rough tok/s numbers for the flash model? I'm excited for unsloth to drop some quants that I can try and run locally, but really curious how it's been performing speed wise. In general I actually over-index on speed over intelligence. I'd rather a model make mistakes quickly and correct in a follow-up than take forever to get a slightly better initial result.

gertlabs · 2026-04-24T17:39:29 1777052369

Take a look at the Time column in https://gertlabs.com/?mode=oneshot_coding -- this is the total time to complete a solution for a reasonably complex problem end-to-end (you would have to divide by avg submission size to estimate tok/s). It's fast in the sense that most of the smart, recent Chinese releases are quite slow, especially the DeepSeek Pro variant. Opus 4.7 is also quite fast.

If pure speed is most important for your use case, GPT-5.3 Chat is the fastest model we've tested and it's still reasonably smart. Not meant for agentic tool usage / long context, though.

So it might be more useful for business applications or non-engineering usage where you don't need exceptional intelligence, but it's useful to get fast, cheap responses.

Lord_Zero · 2026-04-24T16:45:49 1777049149

Why no mention of GPT-5.5?

gertlabs · 2026-04-24T16:51:52 1777049512

Waiting on public API release. Once it drops, results will be up within 24 hours.

gertlabs · 2026-04-25T03:43:13 1777088593

Results are up. GPT 5.5 is a beast.

wahnfrieden · 2026-04-25T04:47:35 1777092455

Have you considered running models like GPT 5.5 inside their agent harness (Codex)?

gertlabs · 2026-04-25T07:03:34 1777100614

I see the value in that, but there are a few reasons that isn't on the immediate roadmap -- mainly, it shifts focus from measuring the model to measuring the harness. The agentic benchmark section you see on the site is comparable to how an agent would perform using an open harness like Pi. But latest tool-using models are pretty well adapted to any harness, so I think that's less of a factor in overall model performance.

wahnfrieden · 2026-04-25T07:59:46 1777103986

Just fresh on my mind after reading this from Codex team member re: performance difference between Pi and Codex app server usage: https://x.com/pashmerepat/status/2046865863979172039

ZeroGravitas · 2026-04-25T10:36:55 1777113415

Well that couldn't be vaguer if he tried. Basically saying, our stuff is better, no reasons given.

wahnfrieden · 2026-04-25T19:40:01 1777146001

Yeah that's why I'm advocating for measuring it in this thread. Some of these models are trained specifically for their official harnesses

gertlabs · 2026-04-22T00:41:53 1776818513

We've been working on a way to address the obvious problems with existing benchmarks, by creating a single comprehensive benchmark that measures things that technical people care about, while also getting as close to an objective, "core intelligence" measurement as possible.

Some demo games are shown on /spectate that gives you an idea of how we test models and why this would be difficult to benchmax. I think our benchmark is by far the best relative measurement of artificial intelligence out there. Feedback is welcome and usually acted upon quickly.

gertlabs · 2026-04-20T22:55:16 1776725716

Early benchmarks show tremendous improvement over Kimi K2 Thinking, which didn't perform well on our benchmarks (and we do use best available quantization).

Kimi K2.6 is currently the top open weights model in one-shot coding reasoning, a little better than GLM 5.1, and still a strong contender against SOTA models from ~3 months ago (comparable to Gemini 3.1 Pro Preview).

Agentic tests are still running, check back tomorrow. Open weights models typically struggle with longer contexts in agentic workflows, but GLM 5.1 still handled them very well, so I'm curious how Kimi ends up. Both the old Kimi and the new model are on the slower side, so that's a consideration that makes them probably less usable for agentic coding work, regardless. The old Kimi K2 model was severely benchmaxxed, and was only really interesting in the context of generating more variation and temperature, not for solving hard problems. The new one is a much stronger generalist.

Overall, the field of open weights models is looking fantastic. A new near-frontier release every week, it seems.

Comprehensive, difficult to game benchmarks at https://gertlabs.com/?mode=oneshot_coding

DustinKlent · 2026-04-21T13:49:33 1776779373

Cool website. I don't understand enough about the various benchmarks or how they're done to judge whether or not anything is accurate, but I love the layout and features especially the spectator feature which is pretty cool. One thing, I saw the "Market simulator" spectator feature but didn't see a corresponding benchmark for that. Is it "Finance" or "Betting" or "Trading"?

gertlabs · 2026-04-21T16:33:17 1776789197

Thanks -- that one is categorized under Trading/Financial, whereas betting is reserved for games like Pot Limit Omaha Hilo.

That's a good idea for a feature request, including the tags for the spectatable demo games.

tmaly · 2026-04-21T00:18:01 1776730681

How would K2.6 compare to Sonnet 4.6 both price and performance wise?

Mattwmaster58 · 2026-04-21T00:22:18 1776730938

In terms of raw token cost, I've seen a couple providers at (all prices in terms of Mtok) $0.95 input/$0.15 cache input/$5 output vs $3 input/$15 output for sonnet.

Task prices of courses will be more interesting - a dumber model may use more tokens to get to the same goal.

esperent · 2026-04-21T03:05:01 1776740701

I'm looking at your table now - is there a reason why you don't include cost? If Opus 4.7 is the winner but costs e.g. 5x as much, that's important information.

gertlabs · 2026-04-21T03:54:25 1776743665

We recently added cost (last week), so data is sparse. Check back in a few weeks and it will be represented somewhere on the homepage, probably in the Efficiency Chart at the bottom. We also plan to show model performance deviation over time after we collect more data.

I'm interested to hear about any other data representations you'd like to see, too. The goal is to convey the most important information as densely as possible, without too much clutter.

DeathArrow · 2026-04-22T04:45:21 1776833121

>I'm interested to hear about any other data representations you'd like to see, too

It would be nice if you can show how much the models drift from the instructions over time

gertlabs · 2026-04-22T15:11:35 1776870695

Not sure what you mean. Time series chart of model performance over time to see if proprietary models get degraded? That's in the works, but we will need a couple months more data collection before launch.

DeathArrow · 2026-04-23T05:30:48 1776922248

Yes, probably performance helps.

The idea is that the larger a coding task is and the longer the coding agent is, the higher the chance is for the agent to not follow the rules and guidelines.

freely0085 · 2026-04-21T04:29:42 1776745782

Can you add Qwen 3.6 max to the leaderboard?

gertlabs · 2026-04-21T05:14:36 1776748476

We will as soon as API access is widely available. Once a model goes live, we typically have one-shot reasoning benchmarks up in ~8 hours and comprehensive agentic/combined benchmarks up after 24-48 hours. We're working on building relationships with each lab to have the results before launch.

cmrdporcupine · 2026-04-20T23:02:50 1776726170

Surprised to see such variance per language

gertlabs · 2026-04-21T00:19:02 1776730742

It's interesting; I can only speculate as to the underlying reason. When given enough time, models outperform in Rust/C++ in longer agentic tasks, and actually perform worst in Python. For tasks that aren't judged on code speed. https://gertlabs.com/?mode=agentic_coding

edude03 · 2026-04-21T16:26:09 1776788769

It makes sense when you consider LLMs don't generalize very well, so they're heavily dependent on how good (how varied as well as how high quality) the training data is

cmrdporcupine · 2026-04-21T17:28:28 1776792508

Well it might explain why pro-Claude vs pro-Codex people keep talking past each other on this forum. I see people all the time assuming that anybody who likes Codex must be some sort of bot because of their own biases, but I work almost exclusively in Rust and find Codex extremely competent (and a much better overall engineer), don't trust Claude/Opus at all... but I see in this bench it scores lower on TypeScript etc. than Opus does.

knollimar · 2026-04-21T02:26:05 1776738365

wait why compare 2.6 to 2 instead of to 2.5?

gertlabs · 2026-04-21T05:19:46 1776748786

Good question. We missed that release entirely. Our automated model checker only went live 2 months ago so they were manually curated prior to that. I'm adding it now. It'll be live in ~12 hours.

gertlabs · 2026-04-21T16:35:22 1776789322

Update: Kimi K2.5 one-shot results are live. It wasn't a noteworthy release compared to K2.6: https://gertlabs.com/?mode=oneshot_coding

DeathArrow · 2026-04-22T04:57:56 1776833876

Can you add C# to supported languages? It's widely used and it be helpful for people and companies to see how different models fare against each other.

gertlabs · 2026-04-22T15:11:59 1776870719

Good idea.

HN For You