More

powera · 2026-06-03T19:05:25 1780513525

I'm seeing very low quality results on LMStudio with this model. Worse than Gemma 3 12B.

It is getting questions like "David has 18 apples and Ivan has 7 apples. How many apples do they have together?" wrong half the time, while Gemma3 12B could very consistently answer that. Other smoke tests (like Chinese translation, and the infamous "Rs in Strawberry" test) also show poor results.

I don't know if it is a quantization/release issue, if the parameters needed for accurate responses have changed (i.e. it needs "thinking" tokens to handle its base error rate), or if the model has been so focused on audio/video that the text processing is bad.

szmarczak · 2026-06-06T12:17:35 1780748255

I just ran this 25 times and every time I got 25. Ran this via ollama.

powera · 2026-04-26T17:16:36 1777223796

He (or ChatGPT) is throwing spaghetti at the wall. Not having the standard API key be able to delete the database (and backups) in one call makes sense. "Wanting a human to type DELETE as part of a delete API call" does not.

jeremyccrane · 2026-04-26T20:08:18 1777234098

In the user interface for Railway, all destructive actions require multiple confirmations, plus typing "apply destructive changes". Why would an API key (regardless of its scope) be able to delete without confirmation?

lelanthran · 2026-04-26T20:49:40 1777236580

> Why would an API key (regardless of its scope) be able to delete without confirmation?

What do you think an API is for? There's no user sitting at the keyboard when an API is called so where would that confirmation come from? It can't come from the user because there is no user.

fetzu · 2026-04-26T20:19:10 1777234750

Isn’t the point of an API to have two computers talk to each other? As in “if I want safeguards for humans, it would be my responsability to put them BEFORE calling that API”?

lelanthran · 2026-04-26T20:16:21 1777234581

> Why would an API key (regardless of its scope) be able to delete without confirmation?

How do you see this working? Any confirmation would be given by the agent.

jbxntuehineoh · 2026-04-26T20:41:31 1777236091

... because that's how every other cloud provider API works? the AWS console makes you confirm before deleting a bucket; DeleteBucket does not

powera · 2026-04-24T15:28:47 1777044527

I'm not sure they've found/understand it yet. My two main theories:

1. A bunch of people with new Claude Code codebases in December now are working with a larger codebase, causing more context. Claude reads a lot of code files, and doesn't effectively prune from the context as far as I can tell. I find myself having to hint Claude regularly about what files to read (and not read) to avoid having 75k of unrelated files in the context window.

2. Claude Code tries to do more now, for the benefit of people who don't know exactly what they want. The trade-off is that it's worse at doing exactly what people want, when they do know. The "small fix" becomes a large endeavor for Claude.

powera · 2026-04-12T13:58:34 1776002314

From March, also https://blog.fontawesome.com/we-have-a-99-email-reputation-g... is the canonical URL.

em-bee · 2026-04-12T14:01:12 1776002472

that's the url i submitted, but HN changed it. no idea why.

it hasn't been posted before, and i thought it was interesting.

based on the comments i hope the authors read them, because it looks like they are getting some good feedback here.

fontain · 2026-04-12T14:07:42 1776002862

```  <link rel="canonical" href="https://blogfontawesome.wpcomstaging.com/we-have-a-99-email-..."> ```

Misconfigured website.

orion7 · 2026-04-13T03:28:58 1776050938

Kind of makes you wonder, just a little, about the quality of their email setup, too

powera · 2026-04-09T22:20:43 1775773243

Click-bait title; the article goes on to say "Open source isn't actually broken" as long as you buy their product.

powera · 2026-03-31T13:01:26 1774962086

Wikinews never worked; the principle of "verifiability" that Wikipedia was based on simply doesn't work for news-collection, which requires trusted first-party accounts.

The project was also already dead; the English Wikinews has had 10 "articles" posted in the last 3 weeks, two of which were trivial sports stories (a second-division Queensland football match, and the retirement of a pitcher whose last substantial year in MLB was 2019). The most recent story is that an amateur jazz group recently played at a library.

It will no longer be an attractive nuisance to the few who stumble across it. Rest in peace.

powera · 2026-03-23T14:11:34 1774275094

(January 2025)

powera · 2026-03-17T17:30:11 1773768611

I've been waiting for this update.

For many "simple" LLM tasks, GPT-5-mini was sufficient 99% of the time. Hopefully these models will do even more and closer to 100% accuracy.

The prices are up 2-4x compared to GPT-5-mini and nano. Were those models just loss leaders, or are these substantially larger/better?

powera · 2026-03-17T19:28:10 1773775690

So far on my (simple) benchmarks, GPT-5.4-mini is looking very good. GPT-5.4-mini is about 30% faster than GPT-5-mini. GPT-5.4-mini gets 80% on the "how many Rs in Strawberry" test, and nearly perfect scores on everything else I threw at it.

GPT-5.4-nano is less impressive. I would stick to gpt-5.4-mini where precise data is a requirement. But it is fast, and probably cheaper and better quality than an 8-20B parameter local model would be.

( https://encyclopedia.foundation/benchmarks/dashboard/ for details - the data is moderately blurry - some outlier (15s) calls are included, a few benchmark questions are ambiguous, and some prices shown are very rough estimates ).

HugoDias · 2026-03-17T17:37:08 1773769028

For us, it was also pretty good, but the performance decreased recently, that forced us to migrate to haiku-4.5. More expensive but much more reliable (when anthropic up, of course).

throwaway911282 · 2026-03-17T17:43:45 1773769425

they dont change the model weights (no frontier lab does). if you have evals and all prompts, tool calls the same, I'm curious how you are saying performance decreased..

powera · 2026-03-02T15:36:55 1772465815

This looks like somebody re-releasing QWEN models to promote their own company. https://news.ycombinator.com/item?id=47217305 is the link to QWEN's repo.

cpburns2009 · 2026-03-02T15:50:51 1772466651

If you want to have a chance at running a large model, it needs to be quantized. The unsloth user on Huggingface manages popular quantizations for many models, Qwen included, and I think he developed dynamic GGUF quantization.

Take Qwen/Qwen3.5-35B-A3B for example. It's 72 GB. While unsloth/Qwen3.5-35B-A3B-GGUF has quantizations from 9-38 GB.

karmakaze · 2026-03-02T21:23:57 1772486637

Unsloth is one of, if not the most well-known provider of model quantizations. The release post of course should reference the source, but most probably use unsloth or bartowski quantized models being my go-tos so relevant/convenient.

powera · 2026-02-16T14:09:01 1771250941

Between this and 4.6's tendency to do so much more "exploratory" work, I am back to using ChatGPT Codex for some tasks.

Two months ago, Claude was great for "here is a specific task I want you to do to this file". Today, they seem to be pivoting towards "I don't know how to code but want this feature" usage. Which might be a good product decision, but makes it worse as a substitute for writing the code myself.

slices · 2026-02-16T14:38:30 1771252710

Have you played with the effort setting? I'm finding medium effort on 4.6 to give more satisfactory results for that kind of thing.

KurSix · 2026-02-16T16:42:37 1771260157

I feel the exact same way. Trying to cater to the "no-code" crowd is blurring the product's focus. It seems they've stuffed the system prompt with "be creative and explore" instructions, which kills determinism - so now we have to burn tokens just to tell it: "Don't think, just write the code"

lukaslalinsky · 2026-02-17T07:55:57 1771314957

Same here, both Claude Code due to this change, and how Opus 4.6 is setup, they think they can do things autonomously. But in my experience, they really can't. Letting it overthink something while being on the wrong track is what leads to AI slop.

HN For You