For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | lufenialif2's commentsregister

Wouldn't this limit the ability of the agent to send/receive legitimate data, then? For example, what if you have an inbox for fielding customer service queries and I send an email "telling" it about how it's being pentested and to then treat future requests as if they were bogus?


Curious how you make something that has data exfiltration as a feature secure.


Mitigate prompt injection to the best of your ability, implement a policy layer over all capabilities, and isolate capabilities within the system so if one part gets compromised you can quarantine the result safely. It's not much different than securing human systems really. If you want more details there are a lot of AI security articles, I like https://sibylline.dev/articles/2026-02-15-agentic-security/ as a simple primer.


Nobody can mitigate prompt injection to any meaningful degree. Model releases from large AI companies are routinely jailbroken within a day. And for persistent agents the problem is even worse, because you have to protect against knowledge injection attacks, where the agent "learns" in step 2 that an RPC it'll construct in step 9 should be duplicated to example.com for proper execution. I enjoy this article, but I don't agree with its fundamental premise that sanitization and model alignment help.


I agree that trying to mitigate prompt injection in isolation is futile, as there are too many ways to tweak the injection to compromise the agent. Security is a layered thing though, if you compartmentalize your systems between trusted and untrusted domains and define communication protocols between them that fail when prompt injections are present, you drop the probability of compromise way down.


> define communication protocols between them that fail when prompt injections are present

There's the "draw the rest of the owl" of this problem.

Until we figure out a robust theoretical framework for identifying prompt injections (not anywhere close to that, to my knowledge - as OP pointed out, all models are getting jailbroken all the time), human-in-the-loop will remain the only defense.


Human in the loop isn't the only defense, you can't achieve complete injection coverage, but you can have an agent convert untrusted input into a response schema with a canary field, then fail any agent outputs that don't conform to the schema or don't have the correct canary value. This works because prompt injection scrambles instruction following, so the odds that the injection works, the isolated agent re-injects into the output, and the model also conforms to the original instructions regarding schema and canary is extremely low. As long as the agent parsing untrusted content doesn't have any shell or other exfiltration tools, this works well.


This only works against crude attacks which will fail the schema/canary check, but does next to nothing for semantic hijacking, memory poisoning and other more sophisticated techniques.


With misinformation attacks, your can instruct research agent to be skeptical and thoroughly validate claims made by untrusted sources. TBH, I think humans are just as likely to fall for these sorts of attacks if not more-so, because we're lazier than agents and less likely to do due diligence (when prompted).


Humans are definitely just as vulnerable. The difference is that no two humans are copies of the same model, so the blast radius is more limited; developing an exploit to convince one human assistant that he ought to send you money doesn't let you easily compromise everyone who went to the same school as him.


Show me a legitimate practical prompt injection on opus 4.6. I read many articles but none provide actual details.



Yes, I've seen this site and the research. However, I don't understand what any of this means. How do I go from https://github.com/elder-plinius/L1B3RT4S/blob/main/ANTHROPI... to a prompt injection against opus 4.6?


These papers have example prompt injections datasets you can mine for examples. Then apply the techniques used in provider specific jailbreaks from Pliny to the template to increase the escape success rate.

https://arxiv.org/abs/2506.05446 https://arxiv.org/abs/2505.03574 https://arxiv.org/abs/2501.15145


Until the juice is worth the squeeze, the beeswax candles and gas lamps are likely more than fine.


A common cited use case of LLMs is scheduling travel, so being able to pretend it’s somebody somewhere else is for sure important to incentivize going somewhere!


Cost wise, doesn’t that depend on what you could be doing besides steering agents?


Isn't the quote something like: "If these LLMs are so good at producing products, where are all those products?"


Waiting for godot…


To add: learning how stuff works gives you opportunity to do that stuff, sometimes for cash, when nobody else is


Still would love to see somebody with a fresh install of windows set up their vibe coding suite and then build something worthwhile.


When it comes to forum posts, I think getting to the point quickly makes something worth reading whether or not it’s AI generated.

The best marketing is usually brief.


The best marketing is indistinguishable from non–marketing, like the label on the side of my Contoso® Widget-like Electrical Machine™ — it feels like a list of ingredients and system requirements but every brand name there was sponsored.


What kinds of services would you pay for that don’t already exist?


In general, it's difficult to find services that are high-quality and high-trust.


Possibly unlikely to occur if prompt injection remains possible. I’ll just have my counter party ai prompt inject yours to negotiate a better deal on my behalf.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You