For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | bethekind's commentsregister

1 client, 1 agent? Interesting

Poolometer looks cool! I will say your smiley face icons look a lil odd and ai generated, but otherwise I love the graph tracking and suggestions

I agree. Gemini models are held back by their segmentation of usage between multiple products, combined with their awful harnesses and tooling. Gemini cli, antigravity, Gemini code assist, Jules.... The list goes on. Each of these products has only a small limit and they must share usage.

It gets worse than that though. Most harnesses that are made to handle codex and Claude cannot handle Gemini 3.1 correctly. Google has trained Gemini 3.1 to return different json keys than most harnesses expect resulting in awful results and failure. (Based on me perusing multiple harness GitHub issues after Gemini 3.1 came out)


They likely already have. You can use all caps and yell at Claude and it'll react normally, while doing do so with chatgpt scares it, resulting in timid answers

I think this is simply a result of what's in the Claude system prompt.

> If the person becomes abusive over the course of a conversation, Claude avoids becoming increasingly submissive in response.

See: https://platform.claude.com/docs/en/release-notes/system-pro...


This is something inherently hard to avoid with a prompt. The model is instruction-tuned and trained to interpret anything sent under the user role as an instruction, not necessarily in a straightforward manner. Even if you train it to refuse or dodge some inputs (which they do), it's going to affect model's response, often in subtle ways, especially in a multiturn convo. Anthropic themselves call this the character drift.

For me GPT always seems to get stuck in a particular state where it responds with a single sentence per paragraph, short sentences, and becomes weirdly philosophical. This eventually happens in every session. I wish I knew what triggers it because it's annoying and completely reduces its usefulness.

Usually a session is delivered as context, up to the token limit, for inference to be performed on. Are you keeping each session to one subject? Have you made personalizations? Do you add lots of data?

It would be interesting if you posted a couple of sessions to see what 'philosophical' things it's arriving at and what proceeds it.


I think my next steps are: 1) try out openai $20/month. I've heard they're much more generous. 2) try out open router free models. I don't need geniuses, so long as I can see the thinking (something that Claude code obfuscates by default) I should be good. I've heard good things about the CLIO harness and want to try openrouter+clio

Word on the street is that Opus is much much larger of a model than GPT-5.4 and that’s why the rate limits on Codex are so much more generous. But I guess you could also just switch to Sonnet or Haiku in Claude Code?

I'm taking a bet on local models to do the non genius work. Gemma 4 (released yesterday) has been designed to run on laptops / edge devices....and so far is running pretty well for me.

How’s Gemma 4 been?

Edge models are good for their purpose but putting them in agentic flow with current ollama quants on a Mac Mini I see high tool use error rate and output hallucination.

For JSON to text formatting it works well on a one-round basis. So I think you should realistically have an evaluation ready to go so you can use it on these models. I currently judge them myself but people often use a smart LLM as judge.

Today writing eval harness with Claude is 5 min job. Do it yourself so you can explore as quants on Gemma get better.


OpenAI has the better coding model anyways. You will be pleasantly surprised by Codex. The TUI tool is less buggy and runs faster and it's a more careful and less error-prone model. It's not as "creative" but it's more intelligent.

On top of that their $20 plan has much higher usage limits than Anthropic's $20 plan and they allow its use in e.g. opencode. So you can set up opencode to use both OpenAI's codex plan plus one of the more intelligent Chinese models so you can maximize your usage. Have it fully plan things out using GPT 5.4, write code using e.g. Qwen 3.6, then switch back to GPT 5.4 for review


Openrouter free models have 50 requests per day limit + data collection. As per their doc.

You can charge $10 on the account and get unlimited requests. I abused this last week with the nemotron super to test out some stuff and made probably over 10000 requests over a couple of days and didn't get blocked or anything, expect 5xx errors and slowdowns tho.

i tried out gpt 5.4 xhigh and it did meaningfully worse with the same prompt as opus 4.6. like, obvious mistakes

I've been pretty satisfied using oh-my-openagent (omo) on opencode with both opus-4.6 and gpt-5.4 lately. The author of omo suggests different prompting strategies for different models and goes into some detail here. https://github.com/code-yeongyu/oh-my-openagent/blob/dev/doc... For each agent they define, they change the prompt depending on which model is being used to fit it. I wonder how much of the "x did worse than y for the same prompt" tests could be improved if the prompts were actually tailored to what the model is good at. I also wonder if any of this matters or if it's all a crock of bologna..

i think it may matter a good bit. i definitely have to write in different styles with different models (and catch myself doing so unintentionally) now that you mention it...

definitely not bologna, at least anecdotally :)


Fwiw I run this eval every week on a set of known prompts and I believe the in group differences are bigger than out group.

That is I get more variance between opus 4.6 and itself than I do between the sota models.

I don’t have the budget for statistical relevance but I’m convinced people claiming broad differences are just vibing, or there are times when agent features make a big difference.


it may be the agent features in my case. now that i think about it, i also forgot that my CLAUDE.md is different from my AGENTS.md

either way, all that one can really rely on is the benchmarks, and those are easily cheated/overfitted to.

i think it's all very hard to quantify, so take my previous comment with a massive rock of salt


This. If I run 4 Claude code opus agents with subagents, my 8gb of RAM just dies.

I know they can do better


I've used Gemini and now claude. Both were meh until I found the superpowers skill. Will be trying chatgpt next month.

You can "feel" the llm being limited with Gemini, less so with Claude. Hopefully even less so with chatgpt


I used it to speed up an codecompass-like repo from 86 files per second to 2000. Still haven't used the repo in production, so maybe it secretly broke things, but the ability to say: "optimize this benchmark and commit only if you pass these tests" is nice


I love the phrase "inference hunger"


Jane Street had a cool video about how you can address lack of training data in a programming language using llm patching. Video is called "Arjun Guha: How Language Models Model Programming Languages & How Programmers Model Language Models"

The big take away is that you can "patch" llms and steer them to correct answers in less trained programming languages, allowing for superior performance. Might work here. Not a clue how to implement, but stuff to llm-to-doc and the like makes me hopeful


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You