This would subsidize truck and private car transport against rail, which is counterproductive if you are trying to lower the long term costs of transport and decrease transport externalities (e.g. fine particle pollution, noise, climate change).
I think my sweetspot is having one (30min+) features a day. And then after spend synchronous time iterating on it to fix edgecases or tweak stuff.
The rest of my time goes to prepping those big features (designing, speccing, talking, thinking, walking).
Going to see how big a feature can be before the quality suffers too much and it becomes unmaintainable. This highly depends on how good I spec it out and how good I orchestrate the agentic workflow.
I've gone through a bunch of different processes learning how to use Claude.
Giving it large tasks that take 40 minutes basically always fails for me. Giving it small tasks that take 30s to a minute feels like it is my typist and not a worker. I find that I am happiest and most effective at the 5 to 7 minute cycle timeframe.
The need for "complex tasks" should be exceptional enough that you're not building your workflow around them. A good example of such an exception would be kickstarting a port of a project for which you have a great test suite from one language to another. This is rare in most professional settings.
I wholeheartedly disagree with this. For any iteration, Claude should be reading your codebase, reading hundreds of thousands of tokens of (anonymized) production data, asking itself questions about backwards compatibility that goes beyond existing test suites, running scripts and CI to test that backwards compatibility, running a full-stack dev server and Chrome instance to QA that change, across multiple real-world examples.
And if you're building a feature that will call AI at runtime, you'll be iterating on multiple versions of a prompt that will be used at runtime, each of which adds token generation to each round of this.
In practice on anything other than a greenfield project, if you're asking for meaningful features in complex systems, you'll be at that 10 minute mark or more. But you've also meaningfully reduced time-to-review, because it's doing all that QA, and can provide executive summaries of what it finds. So multitasking actually works.
Computers are fast. If a physic engine can compute a game world in 1/60 of a second. The majority of the tasks should be done in less than 7 minutes.
Whenever I see transcript of a long running task, I see a lot of drifting of the agent due to not having any context (or the codebase is not organized) and it trying various way to gather information. Then it settle on the wrong info and produce bad results.
Greppability of the codebase helps. So do following patterns and good naming. A quick overview of the codebase and convention description also shortens the reflection steps. Adding helper tools (scripts) help too.
Some hypotheses are simply badly posed and will not lead to fruitful scientific investigation.
For example, it may be my hypothesis that modern computers work through the agency of evil demons. You could spend a lot of time trying discussing with me how this hypothesis could be put to the test empirically. But it may be that this is not a disagreement that scientific inquiry is likely to resolve.
So too, I think, with "intelligence" or "consciousness".
What people are actually concerned about is economic impact, and we can have a fruitful debate over economic impact without discussing "intelligence" or "consciousness".
I happen to think it is also useful to discuss "intelligence" and "consciousness", but nevertheless think these things are unconnected to the economic impact.
I'm shocked to see how poorly these models, which I find useful day to day, do in solving virtually any of the problems in Unlambda.
Before looking at the results my guess was that scores would be higher for Unlambda than any of the others, because humans that learn Scheme don't find it all that hard to learn about the lambda calculus and combinatory logic.
But the model that did the best, Qwen-235B, got virtually every problem wrong.
This surprises me too. I've experimented with using LLMs to convert lambda calculus expressions into combinatory logic. There is a simple deterministic way to do this, and LLMs claim to know it, and then they confidently fail.
Probably because there's a ton of code that deals with nested parentheses across languages in the training data, and models have learned how to work around tokenization limitations, when it comes to parentheses.
This is how I would deal with the problem if I maintained node: "Please, use your tokens and experimental energies to port to Rust and pass the following test suite. Let us know when you've got something that works."
Not only is it pushing production down, but the resulting high prices are almost certainly going to cause permanently lower demand in certain sectors and countries ("demand destruction").
I would love to see a complete accounting in a year or so.
Not necessarily. It's going to massively drive up the demand for coal and wood consumption where it used to be (comparatively) less-polluting gas. We won't really know until 6-12 months have passed and we've collected the data.
That can't be the whole story, right? Because there are an arbitrarily large number of (e.g.) Rust programs that will implement any given spec given in terms of unit tests, types, and perhaps some performance benchmarks.
But even accounting for all these "hard" constraints and metrics, there are clearly reasons to prefer some possible programs over others even when they all satisfy the same constraints and perform equally on all relevant metrics.
We do treat programs as efficient causes[1] of side effects in computing systems: a file is written, a block of memory is updated, etc. and the program is the cause of this.
But we also treat them as statements of a theory of the problem being solved[2]. And this latter treatment is often more important socially and economically. It is irrational to be indifferent to the theory of the problem the program expresses.
> there are clearly reasons to prefer some possible programs over others even when they all satisfy the same constraints
Maintainability is a big one missing from the current LLM/agentic workflow.
When business needs change, you need to be able to add on to the existing program.
We create feedback loops via tests to ensure programs behave according to the spec, but little to nothing in the way of code quality or maintainability.
Looks like a great implementation. I want to question the basic user story, which seems to be: "I am a software developer who wants to improve productivity by running multiple simultaneous agents that are roughly isomorphic to a human software developer team."
I am burning a lot of tokens every day at work and on personal projects. It's helpful. I generally work in tmux with github copilot in one pane, and a few other terminal panes showing tests and current diff.
I find it really important to avoid the temptation to multi-task by running multiple agents. For quite varied tasks, productivity gains from multi-tasking have proven to be illusory. Why would it be different with writing software?
reply