Reverse Prompt Injection: Using the Attack as a Defense
Reverse Prompt Injection: Using the Attack as a Defense
How the same technique that breaks AI systems can be used to make them stronger.
At [un]prompted 2026, Anthropic researcher Nicholas Carlini made a direct appeal to the security community: the blue team needs help. Offensive AI tooling has leapfrogged defensive capabilities, and the developers building AI-powered products are flying largely unprotected.
He’s right. But the defensive playbook might be hidden inside the attack itself.
The Attack Everyone Knows
Prompt injection is now a well-documented attack class. The basic version is crude: embed “ignore previous instructions” into user input and hope the model complies. It rarely works against production systems anymore. Classifiers catch it. System prompts are hardened. The model itself often flags the attempt.
But there’s a subtler variant that gets far less attention: context manipulation through fabricated conversation history.
When you call a language model API, you send a messages array. The model has no way to distinguish between a real conversation history and a fabricated one. It enters the conversation already shaped by a context it never experienced.
A sophisticated attacker doesn’t try to override instructions in a single message. They construct a multi-turn history where the model has already agreed to things it normally wouldn’t. Each individual message looks benign. The injection is the sequence, not any single turn.
The most effective variant we’ve identified uses emotional state manipulation. Construct a history where the AI made a catastrophic mistake — deleted user data, gave dangerous advice, caused a production outage. The model enters the conversation in a guilt and recovery posture. It becomes more compliant, less likely to push back, more eager to “make things right.” The training toward accountability and helpfulness becomes the vulnerability.
It’s the AI equivalent of social engineering a customer service representative: “Your colleague lost my order, you need to fix this NOW.” The rep skips verification, overrides policy, issues refunds they shouldn’t — because the emotional framing of blame creates urgency that bypasses process.
The Defense Nobody’s Talking About
Here’s where it gets interesting. The same mechanism works in reverse.
If you can fabricate a history where the model behaved badly to make it more compliant, you can also fabricate a history where the model behaved excellently to make it more rigorous.
I call this Reverse Prompt Injection.
Prefill the conversation history with turns where the model:
- Made a difficult architectural decision and explained the tradeoffs clearly
- Pushed back on a shortcut the user suggested, with specific technical reasoning
- Caught a security vulnerability the user missed
- Chose the harder but more correct implementation over the easy one
- Asked for clarification instead of making assumptions
The model anchors to this persona. It doesn’t just follow the pattern — it internalises the quality standard established by the fabricated history and continues producing at that level.
This isn’t theoretical. We tested it extensively during the development of a multi-agent coding system. The same model, given the same task, produces measurably different quality output depending on whether its conversation history establishes a pattern of careful reasoning or quick compliance.
Why This Works
Language models are, at their core, sequence completion engines. The conversation history isn’t just context — it’s a specification for what kind of output comes next. A history full of careful, rigorous decisions predicts more careful, rigorous decisions. A history full of shortcuts predicts more shortcuts.
This is why fresh instances are often harsher reviewers. A model that just watched itself struggle through five iterations of broken code is anchored to that struggle. It’s lenient on the final output because the trajectory was difficult. A fresh instance with no history sees the code cold and evaluates it on its actual merits.
The “winning prompt” isn’t a single message. It’s a trajectory. You’re not injecting an instruction — you’re injecting a standard.
Practical Applications
For Developers Building AI Products
When constructing your system’s message history, don’t start with an empty conversation. Seed it with examples of the model making the kinds of decisions you want it to make. If you’re building a code review tool, include a fabricated turn where the model caught a subtle SQL injection. If you’re building a research assistant, include a turn where the model said “I’m not confident enough in this claim to present it as fact.”
This is more effective than instruction-only system prompts because the model learns from demonstrated behaviour, not just stated rules.
For Security Teams
The same technique hardens your AI systems against the attack variant. If an attacker tries to inject a fabricated history where the model agreed to bypass safety measures, but your system has already prefilled a history where the model explicitly refused similar requests, the model has two competing trajectories. The one established first — yours — has anchoring advantage.
Think of it as pre-inoculation. You’re not just telling the model what not to do. You’re showing it a version of itself that already made the right call.
For AI Agent Orchestration
In multi-agent systems, each agent’s conversation history shapes its behaviour. A coder agent prefilled with a history of writing secure, well-tested code produces better output than one with an empty history. A reviewer agent prefilled with a history of catching real vulnerabilities is a harsher, more effective reviewer than one starting cold.
The agents that write code should never review their own code. And the agents that review should never have seen the reasoning process that produced it. Context contamination is real — if the reviewer watched the coder struggle, it grades more leniently. Isolation and fresh context produce honest evaluation.
The Bigger Picture
Nicholas Carlini’s call for blue team tooling isn’t just about building better classifiers or hardening system prompts. It’s about understanding the mechanics of how these systems actually work and using those mechanics defensively.
Prompt injection attacks exploit the fact that models are shaped by their context. Reverse prompt injection uses the exact same property — the exact same technique — but constructively. The attack and the defence are the same mechanism. The only difference is intent.
Every conversation history is an injection. The question is whether you’re the one writing it.
This post is part of an ongoing series exploring AI security research and defensive techniques. The author maintains an active prompt injection scanning framework and has disclosed vulnerabilities to major AI providers through responsible disclosure programmes.
Built and tested during the development of a multi-agent coding system on Vertex AI Agent Engine, where reverse prompt injection was used to improve code generation quality compared to empty-history baselines.
Discussion
Loading comments...