More

bwestergard · 2026-04-05T18:19:01 1775413141

This would subsidize truck and private car transport against rail, which is counterproductive if you are trying to lower the long term costs of transport and decrease transport externalities (e.g. fine particle pollution, noise, climate change).

bwestergard · 2026-03-27T18:25:44 1774635944

If you are letting Claude run for seven minutes at a time, you aren't thinking hard enough about what you're building.

If you start trying to juggle multiple agents, you are doubling down on the wrong strategy.

https://hbr.org/2010/12/you-cant-multi-task-so-stop-tr

jochem9 · 2026-03-28T09:21:14 1774689674

I think my sweetspot is having one (30min+) features a day. And then after spend synchronous time iterating on it to fix edgecases or tweak stuff.

The rest of my time goes to prepping those big features (designing, speccing, talking, thinking, walking).

Going to see how big a feature can be before the quality suffers too much and it becomes unmaintainable. This highly depends on how good I spec it out and how good I orchestrate the agentic workflow.

jeapostrophe · 2026-03-27T20:23:53 1774643033

I've gone through a bunch of different processes learning how to use Claude.

Giving it large tasks that take 40 minutes basically always fails for me. Giving it small tasks that take 30s to a minute feels like it is my typist and not a worker. I find that I am happiest and most effective at the 5 to 7 minute cycle timeframe.

samename · 2026-03-27T18:30:35 1774636235

Why should Claude finish complex tasks in less than seven minutes?

bwestergard · 2026-03-27T18:38:13 1774636693

The need for "complex tasks" should be exceptional enough that you're not building your workflow around them. A good example of such an exception would be kickstarting a port of a project for which you have a great test suite from one language to another. This is rare in most professional settings.

btown · 2026-03-27T21:15:25 1774646125

I wholeheartedly disagree with this. For any iteration, Claude should be reading your codebase, reading hundreds of thousands of tokens of (anonymized) production data, asking itself questions about backwards compatibility that goes beyond existing test suites, running scripts and CI to test that backwards compatibility, running a full-stack dev server and Chrome instance to QA that change, across multiple real-world examples.

And if you're building a feature that will call AI at runtime, you'll be iterating on multiple versions of a prompt that will be used at runtime, each of which adds token generation to each round of this.

In practice on anything other than a greenfield project, if you're asking for meaningful features in complex systems, you'll be at that 10 minute mark or more. But you've also meaningfully reduced time-to-review, because it's doing all that QA, and can provide executive summaries of what it finds. So multitasking actually works.

skydhash · 2026-03-27T18:37:13 1774636633

Computers are fast. If a physic engine can compute a game world in 1/60 of a second. The majority of the tasks should be done in less than 7 minutes.

Whenever I see transcript of a long running task, I see a lot of drifting of the agent due to not having any context (or the codebase is not organized) and it trying various way to gather information. Then it settle on the wrong info and produce bad results.

Greppability of the codebase helps. So do following patterns and good naming. A quick overview of the codebase and convention description also shortens the reflection steps. Adding helper tools (scripts) help too.

bwestergard · 2026-03-27T12:04:15 1774613055

"why bother building one version of the software for everyone?"

So one user's experience is relevant to another, so they can learn from one another?

bwestergard · 2026-03-24T18:33:38 1774377218

Some hypotheses are simply badly posed and will not lead to fruitful scientific investigation.

For example, it may be my hypothesis that modern computers work through the agency of evil demons. You could spend a lot of time trying discussing with me how this hypothesis could be put to the test empirically. But it may be that this is not a disagreement that scientific inquiry is likely to resolve.

So too, I think, with "intelligence" or "consciousness".

What people are actually concerned about is economic impact, and we can have a fruitful debate over economic impact without discussing "intelligence" or "consciousness".

ben_w · 2026-03-24T18:39:18 1774377558

Indeed.

I happen to think it is also useful to discuss "intelligence" and "consciousness", but nevertheless think these things are unconnected to the economic impact.

bwestergard · 2026-03-24T19:26:50 1774380410

Agreed.

bwestergard · 2026-03-23T15:05:18 1774278318

Good luck with that at the shareholder meeting.

bwestergard · 2026-03-19T21:28:50 1773955730

I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.

Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.

But the model that did the best, Qwen-235B, got virtually every problem wrong.

joshmoody24 · 2026-03-20T19:07:49 1774033669

This surprises me too. I've experimented with using LLMs to convert lambda calculus expressions into combinatory logic. There is a simple deterministic way to do this, and LLMs claim to know it, and then they confidently fail.

__alexs · 2026-03-19T21:31:40 1773955900

They are also weirdly bad at Brainfuck which is basically just a subset of C.

astrange · 2026-03-19T23:59:40 1773964780

BF involves a lot of repeated symbols, which is hard for tokenized models. Same problem as r's in strawberry.

bwestergard · 2026-03-20T00:34:43 1773966883

Interesting. So why do the models seem to handle deeply nested Lisp expressions just fine?

kgeist · 2026-03-20T01:17:49 1773969469

Probably because there's a ton of code that deals with nested parentheses across languages in the training data, and models have learned how to work around tokenization limitations, when it comes to parentheses.

astrange · 2026-03-23T03:10:20 1774235420

It's because the models wouldn't work for coding if they couldn't do nested scopes, so people don't release models unless they work.

They can only do it in a limited form though, because transformer models only have limited "memory". I don't think they can fully implement parsing.

culi · 2026-03-20T05:17:59 1773983879

Yeah well they also still struggle with "4 + 6 / 9" so I'm not sure why anyone is surprised with these findings

bwestergard · 2026-03-19T21:16:43 1773955003

This is how I would deal with the problem if I maintained node: "Please, use your tokens and experimental energies to port to Rust and pass the following test suite. Let us know when you've got something that works."

bwestergard · 2026-03-19T20:03:23 1773950603

Not only is it pushing production down, but the resulting high prices are almost certainly going to cause permanently lower demand in certain sectors and countries ("demand destruction").

I would love to see a complete accounting in a year or so.

pseudohadamard · 2026-03-20T07:11:56 1773990716

Not necessarily. It's going to massively drive up the demand for coal and wood consumption where it used to be (comparatively) less-polluting gas. We won't really know until 6-12 months have passed and we've collected the data.

bwestergard · 2026-03-18T18:29:24 1773858564

That can't be the whole story, right? Because there are an arbitrarily large number of (e.g.) Rust programs that will implement any given spec given in terms of unit tests, types, and perhaps some performance benchmarks.

But even accounting for all these "hard" constraints and metrics, there are clearly reasons to prefer some possible programs over others even when they all satisfy the same constraints and perform equally on all relevant metrics.

We do treat programs as efficient causes[1] of side effects in computing systems: a file is written, a block of memory is updated, etc. and the program is the cause of this.

But we also treat them as statements of a theory of the problem being solved[2]. And this latter treatment is often more important socially and economically. It is irrational to be indifferent to the theory of the problem the program expresses.

[1]: https://en.wikipedia.org/wiki/Four_causes#Efficient

[2]: https://pages.cs.wisc.edu/~remzi/Naur.pdf

MeetingsBrowser · 2026-03-18T18:59:15 1773860355

> there are clearly reasons to prefer some possible programs over others even when they all satisfy the same constraints

Maintainability is a big one missing from the current LLM/agentic workflow.

When business needs change, you need to be able to add on to the existing program.

We create feedback loops via tests to ensure programs behave according to the spec, but little to nothing in the way of code quality or maintainability.

bwestergard · 2026-03-18T18:22:00 1773858120

Looks like a great implementation. I want to question the basic user story, which seems to be: "I am a software developer who wants to improve productivity by running multiple simultaneous agents that are roughly isomorphic to a human software developer team."

I am burning a lot of tokens every day at work and on personal projects. It's helpful. I generally work in tmux with github copilot in one pane, and a few other terminal panes showing tests and current diff.

I find it really important to avoid the temptation to multi-task by running multiple agents. For quite varied tasks, productivity gains from multi-tasking have proven to be illusory. Why would it be different with writing software?

https://en.wikipedia.org/wiki/Human_multitasking

HN For You