For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | powera's commentsregister

I'm seeing very low quality results on LMStudio with this model. Worse than Gemma 3 12B.

It is getting questions like "David has 18 apples and Ivan has 7 apples. How many apples do they have together?" wrong half the time, while Gemma3 12B could very consistently answer that. Other smoke tests (like Chinese translation, and the infamous "Rs in Strawberry" test) also show poor results.

I don't know if it is a quantization/release issue, if the parameters needed for accurate responses have changed (i.e. it needs "thinking" tokens to handle its base error rate), or if the model has been so focused on audio/video that the text processing is bad.


I just ran this 25 times and every time I got 25. Ran this via ollama.

He (or ChatGPT) is throwing spaghetti at the wall. Not having the standard API key be able to delete the database (and backups) in one call makes sense. "Wanting a human to type DELETE as part of a delete API call" does not.


In the user interface for Railway, all destructive actions require multiple confirmations, plus typing "apply destructive changes". Why would an API key (regardless of its scope) be able to delete without confirmation?


> Why would an API key (regardless of its scope) be able to delete without confirmation?

What do you think an API is for? There's no user sitting at the keyboard when an API is called so where would that confirmation come from? It can't come from the user because there is no user.


Isn’t the point of an API to have two computers talk to each other? As in “if I want safeguards for humans, it would be my responsability to put them BEFORE calling that API”?


> Why would an API key (regardless of its scope) be able to delete without confirmation?

How do you see this working? Any confirmation would be given by the agent.


... because that's how every other cloud provider API works? the AWS console makes you confirm before deleting a bucket; DeleteBucket does not


I'm not sure they've found/understand it yet. My two main theories:

1. A bunch of people with new Claude Code codebases in December now are working with a larger codebase, causing more context. Claude reads a lot of code files, and doesn't effectively prune from the context as far as I can tell. I find myself having to hint Claude regularly about what files to read (and not read) to avoid having 75k of unrelated files in the context window.

2. Claude Code tries to do more now, for the benefit of people who don't know exactly what they want. The trade-off is that it's worse at doing exactly what people want, when they do know. The "small fix" becomes a large endeavor for Claude.



that's the url i submitted, but HN changed it. no idea why.

it hasn't been posted before, and i thought it was interesting.

based on the comments i hope the authors read them, because it looks like they are getting some good feedback here.


``` <!-- SEO/Feeds --> <link rel="canonical" href="https://blogfontawesome.wpcomstaging.com/we-have-a-99-email-..."> ```

Misconfigured website.


Kind of makes you wonder, just a little, about the quality of their email setup, too


Click-bait title; the article goes on to say "Open source isn't actually broken" as long as you buy their product.


Wikinews never worked; the principle of "verifiability" that Wikipedia was based on simply doesn't work for news-collection, which requires trusted first-party accounts.

The project was also already dead; the English Wikinews has had 10 "articles" posted in the last 3 weeks, two of which were trivial sports stories (a second-division Queensland football match, and the retirement of a pitcher whose last substantial year in MLB was 2019). The most recent story is that an amateur jazz group recently played at a library.

It will no longer be an attractive nuisance to the few who stumble across it. Rest in peace.


(January 2025)


I've been waiting for this update.

For many "simple" LLM tasks, GPT-5-mini was sufficient 99% of the time. Hopefully these models will do even more and closer to 100% accuracy.

The prices are up 2-4x compared to GPT-5-mini and nano. Were those models just loss leaders, or are these substantially larger/better?


So far on my (simple) benchmarks, GPT-5.4-mini is looking very good. GPT-5.4-mini is about 30% faster than GPT-5-mini. GPT-5.4-mini gets 80% on the "how many Rs in Strawberry" test, and nearly perfect scores on everything else I threw at it.

GPT-5.4-nano is less impressive. I would stick to gpt-5.4-mini where precise data is a requirement. But it is fast, and probably cheaper and better quality than an 8-20B parameter local model would be.

( https://encyclopedia.foundation/benchmarks/dashboard/ for details - the data is moderately blurry - some outlier (15s) calls are included, a few benchmark questions are ambiguous, and some prices shown are very rough estimates ).


For us, it was also pretty good, but the performance decreased recently, that forced us to migrate to haiku-4.5. More expensive but much more reliable (when anthropic up, of course).


they dont change the model weights (no frontier lab does). if you have evals and all prompts, tool calls the same, I'm curious how you are saying performance decreased..


This looks like somebody re-releasing QWEN models to promote their own company. https://news.ycombinator.com/item?id=47217305 is the link to QWEN's repo.


If you want to have a chance at running a large model, it needs to be quantized. The unsloth user on Huggingface manages popular quantizations for many models, Qwen included, and I think he developed dynamic GGUF quantization.

Take Qwen/Qwen3.5-35B-A3B for example. It's 72 GB. While unsloth/Qwen3.5-35B-A3B-GGUF has quantizations from 9-38 GB.


Unsloth is one of, if not the most well-known provider of model quantizations. The release post of course should reference the source, but most probably use unsloth or bartowski quantized models being my go-tos so relevant/convenient.


Between this and 4.6's tendency to do so much more "exploratory" work, I am back to using ChatGPT Codex for some tasks.

Two months ago, Claude was great for "here is a specific task I want you to do to this file". Today, they seem to be pivoting towards "I don't know how to code but want this feature" usage. Which might be a good product decision, but makes it worse as a substitute for writing the code myself.


Have you played with the effort setting? I'm finding medium effort on 4.6 to give more satisfactory results for that kind of thing.


I feel the exact same way. Trying to cater to the "no-code" crowd is blurring the product's focus. It seems they've stuffed the system prompt with "be creative and explore" instructions, which kills determinism - so now we have to burn tokens just to tell it: "Don't think, just write the code"


Same here, both Claude Code due to this change, and how Opus 4.6 is setup, they think they can do things autonomously. But in my experience, they really can't. Letting it overthink something while being on the wrong track is what leads to AI slop.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You