For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | Leynos's commentsregister

Opus 4.6 currently leads the remote labor index at 4.17. GPT-5.4 isn't measured on that one though: https://www.remotelabor.ai/

GPT 5.4 Pro leads Frontier Maths Tier 4 at 35%: https://epoch.ai/benchmarks/frontiermath-tier-4/


It's not difficult at all to burn through your weekly limit just writing code.

Wondered if you were able to suggest what might be stopping large scale build out of sodium ion and redox flow batteries?

It feels to me that "walk down the design tree" has a specific meaning with respect to treating the design as a hierarchy (although whether that means BFS or DFS is still ambiguous). "Be critical" lacks that specificity.

Yes but then it’s better to spell those instructions out explicitly, eg state facts, state ambiguities / assumptions, inspect codebase, challenge assumptions, etc.

This particular skill is not great.


Use evals

Coming soon, unit, behavioural and regression tests for your prompts and skills :P


How do you use evals when you’re using Claude Code, given that Claude Code also changes their prompts all the time?

You’ll have:

* Claude model version

* Claude Code prompts and tools

* Your own prompts and skills and whatnot

* Your repository’s source code (= the input)

All of those change constantly, it’s not like it’s some kind of SWE benchmark.


You just said it. If consistency is that important, keep consistent versions of model, harness, prompts, skills, etc., and regression test changes. That way lies madness :)

Kimi is surprisingly good at Rust.

Here's what I suggest:

Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.

Enforce single responsibility, cqrs, domain segregation, etc. Make the code as easy for you to reason about as possible. Enforce domain naming and function / variable naming conventions to make the code as easy to talk about as possible.

Use code review bots (Sourcery, CodeRabbit, and Codescene). They catch the small things (violations of contract, antipatterns, etc.) and the large (ux concerns, architectural flaws, etc.).

Go all in on linting. Make the rules as strict as possible, and tell the review bots to call out rule subversions. Write your own lints for the things the review bots are complaining about regularly that aren't caught by lints.

Use BDD alongside unit tests, read the .feature files before the build and give feedback. Use property testing as part of your normal testing strategy. Snapshot testing, e2e testing with mitm proxies, etc. For functions of any non-trivial complexity, consider bounded or unbounded proofs, model checking or undefined behaviour testing.

I'm looking into mutation testing and fuzzing too, but I am still learning.

Pause for frequent code audits. Ask an agent to audit for code duplication, redundancy, poor assumptions, architectural or domain violations, TOCTOU violations. Give yourself maintenance sprints where you pay down debt before resuming new features.

The beauty of agentic coding is, suddenly you have time for all of this.


> Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.

I feel like i am a bit stupid to be not able to do this. my process is more iterative. i start working on a feature then i disocover some other function thats silightly related. go refactor into commmon code then proceed with original task. sometimes i stop midway and see if this can be done with a libarary somewhere and go look at example. i take many detours like these. I am never working on a single task like a robot. i dont want claude to work like that either .That seems so opposite of how my brain works.

what i am missing.


Again, here's what works for me.

When I get an idea for something I want to build, I will usually spend time talking to ChatGPT about it. I'll request deep research on existing implementations, relevant technologies and algorithms, and a survey of literature. I find NotebookLM helps a lot at this point, as does Elevenreader (I tend to listen to these reports while walking or doing the dishes or what have you). I feed all of those into ChatGPT Deep Research along with my own thoughts about the direction the system, and ask it to produce a design document.

That gets me something like this:

https://github.com/leynos/spycatcher-harness/blob/main/docs/...

If I need further revisions, I'll ask Codex or Claude Code to do those.

Finally, I break that down into a roadmap of phases, steps and achievable tasks using a prompt that defines what I want from each of those.

That gets me this:

https://github.com/leynos/spycatcher-harness/blob/main/docs/...

Then I use an adapted version of OpenAI's execplans recipe to plan out each task (https://github.com/leynos/agent-helper-scripts/blob/main/ski...).

The task plans end up looking like this:

https://github.com/leynos/spycatcher-harness/blob/main/docs/...

At the moment, I use Opus or GPT-5.4 on high to generate those plans, and Sonnet or GPT-5.4 medium to implement.

The roadmap and the design are definitely not set in stone. Each step is a learning opportunity, and I'll often change the direction of the project based on what I learn during the planning and implementation. And of course, this is just what works for me. The fun of the last few months has been everyone finding out what works for them.


You seem to work a lot like how I do. If that is being stupid, then well, count me in too. To be honest, if I had to go through all the work of planning, scope, escalation criteria, etc., then I would probably be better off just writing the damn code myself at that point.


i see lots of posts like stripes minion where they just type a feature into slack chat and agent goes and does it. That doesnt make any sense to me.


To be devil's advocate:

Many of those tools are overpowered unless you have a very complex project that many people depend on.

The AI tools will catch the most obvious issues, but will not help you with the most important aspects (e.g. whether you project is useful, or the UX is good).

In fact, having this complexity from the start may kneecap you (the "code is a liability" cliché).

You may be "shipping a lot of PRs" and "implementing solid engineering practices", but how do you know if that is getting closer to what you value?

How do you know that this is not actually slowing your down?


It depends a lot on what kind of company you are working at, for my work the product concerns are taken care by other people, I'm responsible for technical feasibility, alignment, design but not what features should be built, validating if they are useful and add value, etc., product people take care of that.

If you are solo or in a small company you apply the complexity you need, you can even do it incrementally when you see a pattern of issues repeating to address those over time, hardening the process from lessons learnt.

Ultimately the product discussion is separate from the engineering concerns on how to wrangle these tools, and they should meet in the middle so overbearing engineering practices don't kneecap what it is supposed to do: deliver value to the product.

I don't think there's a hard set of rules that can be applied broadly, the engineering job is to also find technical approaches that balance both needs, and adapt those when circumstances change.


On the one side I reject that product and engineering concerns are separated: Sometimes you want to avoid a feature due to the way it will limit you in the future, even if the AI can churn it in 2 minutes today.

On the other side perhaps your company, like most, does not know how to measure overengineering, cognitive complexity, lack of understanding, balancing speed/quality, morale, etc. but they surely suffer the effects of it.

I suspect that unless we get fully automated engineering / AGI soon, companies that value engineers with good taste will thrive, while those that double down into "ticket factory" mode will stagnate.


> On the one side I reject that product and engineering concerns are separated: Sometimes you want to avoid a feature due to the way it will limit you in the future, even if the AI can churn it in 2 minutes today.

That is exactly not what I meant, I'm sorry if it wasn't clear but your assumption about how my job works is absolutely wrong.

I even mention that the product discussion is separate only on "how to wrangle these tools":

> Ultimately the product discussion is separate from the engineering concerns on how to wrangle these tools, and they should meet in the middle so overbearing engineering practices don't kneecap what it is supposed to do: deliver value to the product.

Delivering value, which means also avoiding a feature that will limit or entrap you in the future.

> On the other side perhaps your company, like most, does not know how to measure overengineering, cognitive complexity, lack of understanding, balancing speed/quality, morale, etc. but they surely suffer the effects of it.

We do measure those and are quite strict about it, most of my design documents are about the trade-offs in all of those dimensions. We are very critical about proposals that don't consider future impacts over time, and mostly reject workarounds unless absolutely necessary (and those require a phase-out timeline for a more robust solution that will be accounted for as part of the initiative, so the cost of the technical debt is embedded from the get-go).

I believe I wasn't clear and/or you misunderstood what I said, I agree with you on all these points, and the company I work for is very much in opposite to a "ticket factory". Work being rejected due to concerns for the overall impact cross-boundaries on doing it is very much praised, and invited.

My comment was focused on how to wrangle these tools for engineering purposes being a separate discussion to the product/feature delivery, it's about tool usage in the most technical sense, which doesn't happen together with product.

We on the engineering side determine how to best apply these tools for the product we are tasked on delivering, the measuring of value delivered is outside and orthogonal to the technical practices since we already account for the trade-offs during proposal, not development time. This measurement already existed pre-AI and is still what we use to validate if a feature should be built or not, its impact and value delivered afterwards, and the cost of maintaining it vs value delivered. All of that includes the whole technical assessment as we already did before.

Determining if a feature should be built or not is ultimately a pairing of engineering and product, taking into account everything you mentioned.

Determining the pipeline of potential future non-technical features at my job is not part of engineering, except for side-projects/hack ideas that have potential to be further developed as part of the product pipeline.


Sorry, I think you're right that I misinterpreted your comment. I still had in mind OP's example (BDD, mutational testing, all that jazz). I apologize!

Reading your comment, it looks like you work for a pretty nice company that takes those things seriously. I envy you!

My concern was that for companies unlike yours that don't have well established engineering practices, it _feels_ that with AI you can go much faster and in fact it's a great excuse to dismantle any remaining practices. But, in reality they either doing busywork or building the wrong thing. My guess is that those are going to learn that this is a bad idea in the future, when they already have a mess to deal with.

To put what I mean into perspective... if you browse OP's profile you can find absolutely gigantic PRs like https://github.com/leynos/weaver/pull/76. I can not review any PR like that in good faith, period.


Can't upvote you enough. This is the way. You aren't vibe coding slop you have built an engineering process that works even if the tools aren't always reliable. This is the same way you build out a functioning and highly effective team of humans.

The only obvious bit you didn't cover was extensive documentation including historical records of various investigations, debug sessions and technical decisions.


Documentation is only useful if it is read. I have found it impossiple to get many humans to read the documentation i write.


Building a fancy looking process doesnt mean output isnt slop. Vibecoders on reddit have even more insane "engineering" process. parent comment has all these

Architecture & Design Principles • Single Responsibility Principle (SRP) • CQRS (Command Query Responsibility Segregation) • Domain Segregation • Domain-Driven Naming Conventions • Clear function/variable naming standards • Architectural constraint definition • Scope definition • Escalation criteria design • Completion criteria definition

Planning & Process • Formal upfront planning • Constraint-based design • Defined scope management • Escalation protocols • Completion criteria tracking • Maintenance sprints (technical debt paydown) • Frequent code audits

AI / Agentic Development Practices • Agent-assisted code audits • Agent-based feedback loops (e.g., reading .feature files pre-build) • Agent-driven reasoning optimization (code clarity for AI) • Continuous automated review cycles

Code Review & Static Analysis • Code review bots: • Sourcery • CodeRabbit • CodeScene • Automated detection of: • Anti-patterns • Contract violations • UX concerns • Architectural flaws

Linting & Code Quality Enforcement • Strict linting rules • Custom lint rules • Enforcement of lint compliance via bots • Detection of lint rule subversion

Testing Strategies

Core Testing • Unit Testing • BDD (Behavior-Driven Development) • .feature file validation before build

Advanced Testing • Property-based testing • Snapshot testing • End-to-end (E2E) testing • With MITM (man-in-the-middle) proxies

Formal / Heavyweight Testing • Model checking • Bounded proofs • Unbounded proofs • Undefined behavior testing

Emerging / Exploratory • Mutation testing • Fuzzing

Code Quality & Auditing • Code duplication detection • Redundancy analysis • Assumption validation • Architectural compliance checks • Domain boundary validation • TOCTOU (Time-of-check to time-of-use) vulnerability analysis

Development Workflow Enhancements • Continuous audit cycles • Debt-first maintenance phases • Feedback-driven iteration • Pre-build validation workflows

Security & Reliability Considerations • TOCTOU vulnerability detection • MITM-based E2E testing • Undefined behavior analysis • Fuzz testing (planned)


And here I am, just drawing diagrams on a whiteboard and designing UI in Balsamiq.


you are prbly shipping so that puts you ahead of most ppl still setting up their perfect process.


Okay, here goes. You can tell when someone is acting in bad faith when they talk about a law that has been in force and enforced since the 1960s is something new.

Of course, "touch grass" works just as well.


Laws can be in existence for decades before they are weaponized against people. It's illegal to have most eBay/Amazon bulbs on your car because they are not DOT approved. If someday they start impounding cars crossing state borders with light bars, fog lights, and LEDs of races they don't want in that state... Someone like you will say "you're just making stuff up, that law has been on effect since 1961."


And do you have statistical evidence to back up your claim of increased enforcement, or are you just reading about it in the Daily Mail?


Does it?


Yup


It's fun. That's the utility. Although you could also probably convey information in a visual shorthand in a way that is quicker to parse than a table.

I want one that shows my agents working in a hipster coffee shop with chrono trigger-style lighting.


An agent is a way of performing an action that will generate context or a useful side effect without having to worry about the intermediate context.

People already do this serially by having a model write a plan, clearing the context, then having the same or a cheaper model action the plan. Doing so discards the intermediate context.

Sub-agents just let you do this in parallel. This works best when you have a task that needs to be done multiple times that cannot be done deterministically. For example, applying the same helper class usage in multiple places across a codebase, finding something out about multiple parts of the codebase, or testing a hypothesis in multiple places across a codebase.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You