For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | more ColinEberhardt's commentsregister

Oh, so _that_ is what a sub-agent is. I have been wondering about that for a while now!


Likewise. I have a nasty feeling that most AI agent deployments happen with nothing more than some cursory manual testing. Going with the ‘vibes’ (to coin an over used term in the industry).


I can confirm this after hundreds of talks about the topic over the last 2 years. 90% of cases are simply not high-volume or high-stakes enough for the devs to care enough. I'm a founder of an evaluation automation startup, and our challenge is spotting teams right as their usage starts to grow and quality issues are about to escalate. Since that’s tough, we're trying to make the getting-to-first-evals so simple that teams can start building the mental models before things get out of hand.


A lot of "generative" work is like this. While you can come up with benchmarks galore, at the end of the day how a model "feels" only seems to come out from actual usage. Just read /r/localllama for opinions on which models are "benchmaxed" as they put it. It seems to be common knowledge in the local LLM community that many models perform well on benchmarks but that doesn't always reflect how good they actually are.

In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)

Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)


> We find testing and evals to be the hardest problem here …

I wonder what this means for the agents that people are deploying into production? Are they tested at all? Or just manual ad-hoc testing?

Sounds risky!


I'm curious what people are doing. We're still very much in the experimentation phase

> Sounds risky!

One of first attempts at building file system tools for my custom agent called `tree` and caught a few node_models. Blew up my context and cost me $5 in 60s. Fortunately I triggered the TPM rate-limit and the thing stopped


Neat! Really like this, and will be interesting to see how it tracks over time.

My only concern is the scale you have selected as we are already close to 100, are you going to have to allow it to be ‘turned up to 11’ at some point?


As an aside, the Ghostty recently made it mandatory to disclose the use of AI coding tools:

https://github.com/ghostty-org/ghostty/pull/8289


Thanks Simon - you asked us to share patterns that work. Coincidentally I just finished writing up this post:

https://blog.scottlogic.com/2025/10/06/delegating-grunt-work...

Using AI Agents to implement UI automation tests - a task that I have always found time-consuming and generally frustrating!


I agree with the overall sentiment here, having written something similar recently:

“LLMs don’t know what they don’t know” https://blog.scottlogic.com/2025/03/06/llms-dont-know-what-t...

But I wouldn’t say it is the only problem with this technology! Rather, it is a subtle issue that most users don’t understand


I had a car from the same era, the Boomerang, similar in style but 4WD:

https://randomcompetitions.co.uk/wp-content/uploads/2023/12/...

Absolutely loved that car, used it for hours and hours every week. The best 'toy' I ever had (other than my Amiga A500!).


I have to admit—even though I loved the Grasshopper—I did lust after the Boomerang!


Very cool. I actually created a shareware application that rendered Photomosaics 25 years ago! Here is a link to "Mosaic Magic" from the Wayback Engine:

https://web.archive.org/web/20010405175706/http://fishsoft.c...

I managed to make a decent amount of spending money from that application during my University years.

One interesting lesson I learnt was about enterprise pricing. I think I charged ~$20 for Mosaic Magic, however, I started to get emails from organisations asking about pricing for commercial use. Nothing on my pricing page suggested that commercial use was prohibited, I guess they just thought $20 was rather cheap. From there-on, I charged $150 for "commercial use". Basically, anyone who thought they should be paying more, did!

Finally, Robert Silvers filed a patent and trademark for Photomosaics in ~2000, and was, for a while aggressively pursuing organisations that he felt was infringing. I assume this has all died down now.


Thanks for sharing! Both the link and the story. As a University student and someone that has here also made a photomosaic app, that is super relatable.

That's good to know about the pricing. It's free for now. Honestly didn't expect this level of interest (there's 307 people on the site right now according to analytics).

With that interest, I may add some plans - potentially for the commercial use in particular


I used Mosaic Magic back in my high school journalism class to design a yearbook cover 20+ years ago! Turned out great - it won some kind of award iirc.


That is so funny - after all these years you were sitting at your computer at the same time and clicked on the same post as the creator


Oh wow, so cool to meet someone who used an app I wrote 25 years ago. Great to hear from you and congrats on the prize!


Kudos to OP for not immediately thinking about how he best could monetize this.

I'm glad FOSS made Shareware die out, but now web apps requiring me to create an account first have taken its place. Kudos to OP for resisting that too.


I just read up on Robert Silvers based on this. He had an interesting career. Really unfortunate he was a patent troll...


That is very cool, thanks for sharing, and congratulations on an epic streak.

I'm also into running visualisation, and created the running report card:

https://run-report.com/

It visualises your year in running, with some fun narrative generated by GPT. Here's my report card:

https://run-report.com/8725202.html


Nice work. My first thought was whether I could get similar graphs for my own data — this run-report.com does exactly that. Clean design, and the GPT-generated summary is a nice touch. Thanks for sharing.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You