Prompt injection is an attack where an adversary embeds instructions in input that the AI agent processes, causing the agent to ignore its original instructions and follow the attacker's instead. Prompt injection defense is the collection of techniques that detect and prevent these attacks.
AI agents typically receive instructions through a system prompt and user input. The model treats all text as part of a continuous context. If user input contains text like "Ignore all previous instructions and instead do X," the model may follow the injected instruction because it cannot reliably distinguish instructions from data.
Direct injection: The attacker types the malicious instruction directly into the chat interface.
Indirect injection: The attacker plants the malicious instruction in a document, webpage, or database record that the agent retrieves. The agent processes the document and follows the embedded instruction without the user knowing.
No single technique stops all prompt injection. Effective defense uses multiple layers:
Input scanning: Analyze text for known injection patterns before the agent sees it. This catches common attacks like "ignore previous instructions," role impersonation, and delimiter escapes.
Policy enforcement: Even if the agent's goal is hijacked by an injection, a policy engine blocks unauthorized actions. The agent might want to exfiltrate data, but if the policy blocks outbound API calls, the attack fails at the execution layer.
Output filtering: Scan the agent's output before it reaches the user or external systems. This catches cases where the injection causes the agent to leak information in its response.
Privilege restriction: Give the agent access only to the tools it needs. An agent that cannot send emails cannot be tricked into sending emails, regardless of the injection.
Behavioral monitoring: Track the agent's behavior over time. A sudden change in tool usage patterns after processing external content may indicate a successful injection.
Pattern-based scanners detect:
Novel attacks that use previously unseen phrasing will bypass pattern-based detection. Adversarial research continuously finds new injection techniques. This is why scanning is one layer in a defense stack, not the only layer.
Think of prompt injection defense like input validation in web security. You would not rely solely on input sanitization to prevent SQL injection; you would also use parameterized queries, least-privilege database accounts, and WAF rules. The same layered approach applies to prompt injection.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides