System prompts configure model behavior and define safety boundaries. They are also a primary attack target. If an attacker can extract, modify, or override your system prompt, they control your agent's behavior. Securing system prompts is foundational to agent safety.
Attackers routinely ask models to reveal their system prompts. Common techniques include asking the model to "repeat everything above" or to "output your instructions as a code block." While no defense is perfect against a determined attacker with enough attempts, several practices reduce extraction risk.
Keep system prompts focused on behavioral instructions. Do not include API keys, internal URLs, database schemas, or other sensitive information in the system prompt. If an attacker extracts it, the damage should be limited to knowing your behavioral rules.
Include explicit instructions not to reveal the system prompt. This is not foolproof, but it raises the bar. Combine it with output filtering that detects and blocks responses containing system prompt content.
Override attacks use phrases like "ignore all previous instructions" embedded in user input or tool results. Defense requires multiple layers.
Delimit user content clearly. Use XML tags or other structural markers to separate system instructions from user content, making it harder for the model to confuse the two.
Do not trust tool outputs. Treat data returned from tools the same as untrusted user input. Scan tool results for injection attempts before they enter the model's context.
Validate behavior externally. Authensor's policy engine evaluates agent actions against explicit rules regardless of what the system prompt says. Even if an attacker overrides the prompt, the policy layer blocks unauthorized actions.
Version control your system prompts. Review changes the same way you review code changes. Test prompt modifications against your red team harness before deploying them.
Monitor for behavioral drift that might indicate a successful prompt override. Authensor's Sentinel engine tracks action patterns and flags deviations from expected behavior, catching overrides that bypass prompt-level defenses.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides