For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | JamesSwift's commentsregister

  a9284923-141a-434a-bfbb-52de7329861d
  d48d5a68-82cd-4988-b95c-c8c034003cd0
  5c236e02-16ea-42b1-b935-3a6a768e3655
  22e09356-08ce-4b2c-a8fd-596d818b1e8a
  4cb894f7-c3ed-4b8d-86c6-0242200ea333
Amusingly (not really), this is me trying to get sessions to resume to then get feedback ids and it being an absolute chore to get it to give me the commands to resume these conversations but it keeps messing things up: cf764035-0a1d-4c3f-811d-d70e5b1feeef

Thanks for the feedback IDs — read all 5 transcripts.

On the model behavior: your sessions were sending effort=high on every request (confirmed in telemetry), so this isn't the effort default. The data points at adaptive thinking under-allocating reasoning on certain turns — the specific turns where it fabricated (stripe API version, git SHA suffix, apt package list) had zero reasoning emitted, while the turns with deep reasoning were correct. we're investigating with the model team. interim workaround: CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 forces a fixed reasoning budget instead of letting the model decide per-turn.


Hey bcherny, I'm confused as to what's happening here. The linked issue was closed, with you seeming to imply there's no actual problem, people are just misunderstanding the hidden reasoning summaries and the change to the default effort level.

But here you seem to be saying there is a bug, with adaptive reasoning under-allocating. Is this a separate issue from the linked one? If not, wouldn't it help to respond to the linked issue acknowledging a model issue and telling people to disable adaptive reasoning for now? Not everyone is going to be reading comments on HN.


It's better PR to close issues and tell users they're holding it wrong, and meanwhile quietly fix the issue in the background. Also possibly safer for legal reasons.

There's a 5 hour difference between the replies, and new data that came in, so the posts aren't really in conflict.

Also it doesn't sound like they know "there's a model issue", so opening it now would be premature. Maybe they just read it wrong, do better to let a few others verify first, then reopen.


Love this. Responding to users. Detail info investigating. Action being taken (at least it seems so).

And all hidden in the comments of a niche forum, while the actual issue is closed and whitewashed? You got played.

Surely you realize it's AI responding? (not sure if /s)

I cannot provide the session ids but I have tried the above flag and can confirm this makes a huge amount of difference. You should treat this as bug and make this as the default behavior. Clearly the adaptive thinking is making the model plain stupid and useless. It is time you guys take this seriously and stop messing with the performance with every damn release.

Just set that flag and already getting similar poor results. new one: 93b9f545-716c-4335-b216-bf0c758dff7c

And another where claude gets into a long cycle of "wait thats not right.. hold on... actually..." correcting itself in train of thought. It found the answer eventually but wasted a lot of cycles getting there (reporting because this is a regression in my experience vs a couple weeks ago): 28e1a9a2-b88c-4a8d-880f-92db0e46ffe8

Another 1395b7d6-f2f1-4e24-a815-73852bcdeed2

It fails to answer my initial question and tells me what I need to do to check. Then it hallucinates the answer based on not researching anything, then it incorrectly comes to a conclusion that is inaccurate, and only when I further prompt it does it finally reach a (maybe) correct answer.

I havent submitted a few more, but I think its safe to say that disabling adaptive thinking isnt the answer here


This kind of thing is harder for regular end-users to understand following the change removing reasoning details.

I am curious. Are you able to see our session text based on the session ID? That was big no in some of the tier-1 places I worked. No employee could see user texts.

IIRC for Enterprise, using /feedback or /bug is an exception to the "we promise not to use your data" agreement.

My guess is there isn't enough hardware, so Anthropic is trying to limit how much soup the buffet serve, did I guess right? And I would absolutely bet the enterprise accounts with millions in spend get priority, while the retail will be first to get throttled.

> The data points at adaptive thinking under-allocating reasoning on certain turns

Will you reopen the issue you incorrectly closed, then…? Or are you just playacting concern?


[flagged]


Have you set effort to high or max?

Even with high effort, the adaptive thinking can just choose no thinking. See bcherny's post they were replying to: https://news.ycombinator.com/item?id=47668520

Yeah I know but you can disable it as we saw

I commented on the GH issue, but Ive had effort set to 'high' for however long its been available and had a marked decline since... checks notes... about 23 March according to slack messages I sent to the team to see if I was alone (I wasnt).

EDIT: actually the first glaring issue I remember was on 20 March where it hallucinated a full sha from a short sha while updating my github actions version pinning. That follows a pattern of it making really egregious assumptions about things without first validating or checking. Ive also had it answer with hallucinated information instead of looking online first (to a higher degree than Ive been used to after using these models daily for the past ~6 months)


It hallucinated a GUID for me instead of using the one in the RFC for webscokets. Fun part was that the beginning was the same. Then it hardcoded the unit tests to be green with the wrong GUID.

The hallucinated GUIDs are a class of failure that prompt instructions will never reliably prevent. The fix that worked: regex patterns running on every file the agent produces, before anything executes.

Well Ive never had the issue before and have hit that / similar issues every few days over the past couple weeks.

Opus 4.6 was definitely a mixed bag for me. Overall Id probably prefer 4.5 but only just barely and I stay on 4.6 just for the "default" nature of it. But if 4.5 is unchanged vs what Ive had on 4.6 lately then 100% I would move back to it. Ill have to test that

Same, I keep using 4.6 to get "used to it" but I find myself wanting semi-regularly.

Exact same timeline as me and my team. Its been maddening. Im a big believer in AI since late last year, but that is only because the models got so good. This puts us dangerously close to before that threshold was crossed so now Im having to do _way_ more work than before

Multiple people on our team independently have noticed a _significant_ drop in quality and intelligence on opus 4.6 the past few weeks. Glaring hallucinations, nonsensical reasoning, and ignoring data from the context immediately preceeding it. Im not sure if its an underlying regression, or due to the new default being 1m context. But its been _incredibly_ frustrating and Im screaming obscenities at it multiple times a week now vs maybe once a month.

Well, technically bun doesnt _prevent_ hooks. It just requires opting into them. And even that also includes a default set of pre-whitelisted packages. A much better system, but not perfect.

And actually just looking this up, it appears claude-code itself was just added to that whitelist : D

https://github.com/oven-sh/bun/commit/5c59842f78880a8b5d9c2e...


Yes, for me it was 2 or 3 rounds of "its just not clicking" before it did and then there was no looking back. Ive heard the same anecdote from lots of others as well.


Sibling comment does a good job of going into flakes, but to answer this

> Do you install a package or a service

A package is like the raw software installation. So eg the bash package is just a wrapper around building bash from source (theoretically from source at least... you can also define a package as being a binary distribution as long as you specify the content hash). The service (actually the _module_) is for everything else around software installation. So, eg, the bash _package_ will build bash and have it available for you to put in your path but the bash _module_ might also configure your .bashrc and set it up as your users shell. It would also generally refer to the bash _package_ so you can do all that plumbing but specify a certain version of bash you want to use.

Another common example: a plex package would again build the plex software but a plex module would perhaps create a plex user, setup systemd units, open your firewall, and create a media directory.

EDIT: the next layer of confusion is that modules (which sort of are a secret-sauce of nix and so naturally you will want to use them a bunch) are specific implementation details of multiple subsystems in nix. Meaning, nixOS and home-manager and nix-darwin all have "modules" but they are not compatible. Each has its own "idea" of what modules are and nix itself doesnt provide this natively. That means things get a little more complicated/involved when you use those ecosystems together. Its not too bad but it is annoying.


Not only is it composable, but it is generalizable. So yes there is also chef, ansible, apt, uv, nodeenv, etc... or there is just nix. It is able to be the "one tool" to rule them all, often with better reproducibility guarantees.


Just a note that if you are on nixOS you can configure things to run in an FHS compatible wrapper (https://ryantm.github.io/nixpkgs/builders/special/fhs-enviro...)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You