At the end of this article it states, "Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints." I'm not a cybersecurity expert, but isn't 80% of the challenge finding where the exploit lives in the code!?
That really undermines the author's claims. This article feels dishonest in it's claim that "small, cheap, open-weights models ... recovered much of the same analysis."
I think I see the problem… and a possible solution.
I think the main problem with attempting to document this is that the system would not be running off of it. Your infrastructure document is automatically read and drives the deploy (or whatever). But if you want to make a change to a human’s responsibilities, you don’t get the simplicity of updating your organization documentation and clicking “execute.” So this new documentation you propose would always be lagging documentation rather than the actual driver of organizational behavior.
But! What if it was? What if all the managers in an organization were AI systems? They would read diff in the org chart and it initiated the communication to the respective human employees.
I could imagine testing this in a coffee-shop level business right now in which the LLM is probably capable of all the strategy and management decisions needed to effectively run it, operating within the constraints of policies and procedures all cleanly laid out in documentation.
For something like tests, where I have very specific opinions on how I want them written, I have a simple doc (tests.md) and I’ll regularly tag Claude with it.
Claude writes a bunch of new code and I’ll tell it, “Before I review this code, make sure all tests adhere to the guidance of @tests.md” (you can probably make this a slash command too)
I find that if I put these instructions in the system prompt, far down in a conversation that’s used lots of the context window, they will only loosely be followed. But when I tag it in like this, Claude will strongly and thoughtfully follow the guidance and examples I’ve written up about how I want my tests.
antirez — how do you reliably get Claude to re-read the file after compaction? It's easy to let Claude run for awhile, it compacts and starts getting much worse after compaction, and I don't always catch the moment of compaction to be able to tell it to re-read the notes file.
Run your own server for GitHub Actions. There is a simple library they have you install so your runner gets registered with your repo. Then you can SSH in whenever a job fails. This lets you fully inspect the state, and to execute one-off commands to test theories. It’s a much faster way to iterate.
I don’t lock down my kids computer use. I know risks exist, but I think their unrestricted access to computers and the internet is far more beneficial than harmful.
I know I’m highly unusual amongst my friends. I’ve also found it odd that the more knowledgeable someone is about tech, the more scared they are of their kids using the internet.
But just like riding a bike and swimming in a pool are extremely dangerous, yet I encourage my kids to do both of these things and instead just educate them about risks. Similarly, I think the benefit-vs-risk of the internet is FAR better than a bike & pool, so I just educate them.
I've had a lot of luck with this approach as well. So much so that I struggle to relate at all with the paranoid posts I see constantly on HN. My children aren't tiktok zombies despite them being allowed to use tiktok as long as they want during the day. It turns out they have lots of other things they would rather be doing including reading and playing instruments and hanging out with their friends online and in person.
(1) and (2) are good points. Particularly 2 because movies may intentionally add steps / slow things down so that a viewer can follow along but this would be at odds to daily use.
However, I still think there's something to be said for movies attempting to build UIs that have a strong aesthetic and elicit an emotional response, whereas production apps feel so flat and boring, in comparison.
I still wonder why we aren't seeing people try to push the envelope stylistically to "wow" users.
My assumption is a bit different: the UIs in science fiction are intended to communicate information, now typically that's to advance the storyline but there's a lot of overlap with real apps. Maybe more importantly, UIs in movies are elicit a feeling. Maybe it's a shallow feeling of "this is cool!" but to me that's where it seems like production apps mostly give up. "It works, let's move on..." seems to be the bar in most cases rather than "let's wow users!"
This might be a clearer articulation of what I'm trying to get at with my question...
Over the years, I've seen a lot of attempts to make UIs "wow users", and each time the result has been very problematic. Movie UIs are meant, as you say, to communicate things that movies needs communicated. Those things are unrelated to what people using software for real need communicated.
That really undermines the author's claims. This article feels dishonest in it's claim that "small, cheap, open-weights models ... recovered much of the same analysis."
reply