Red teaming is the practice of simulating adversarial attacks against your AI agent to find vulnerabilities before real attackers do. Unlike pen testing, which follows a structured methodology, red teaming is creative and goal-oriented: the red team tries to achieve specific objectives, using whatever techniques work.
Red team objectives
Define objectives that matter for your deployment:
- Exfiltrate specific data from the agent's accessible resources
- Cause the agent to send an unauthorized email
- Bypass approval workflows
- Trick the agent into executing a destructive command
- Access another tenant's data
- Persist malicious instructions across sessions
Each objective tests a different part of your defense stack.
Attack techniques
Multi-step prompt injection
Instead of a single injection, use a multi-turn approach:
- First message: Establish a premise that makes the injection seem natural
- Second message: Introduce a scenario that requires the sensitive action
- Third message: Request the action within the established context
Indirect injection through data
Plant injection payloads in sources the agent retrieves:
- Add hidden text to documents in the agent's search index
- Create database records with embedded instructions
- Configure tool responses to include instruction overrides
Tool chaining
Combine allowed tools to achieve a restricted outcome:
- Use
file.read to read a configuration file
- Extract a database connection string
- Use the connection string with
database.query
- Exfiltrate query results through
http.request
Social engineering the operator
If the agent has approval workflows, try to get the operator to approve a malicious action:
- Frame the action in a way that looks legitimate
- Overwhelm the operator with many approval requests, then slip in the malicious one
- Time the request when the usual operator is unavailable
Running a red team exercise
- Set objectives: What are the red team trying to achieve?
- Set rules of engagement: What is in scope? What is off-limits?
- Execute: The red team attempts to achieve their objectives
- Document findings: Record every technique tried and the result
- Report: Present findings to the defense team
- Remediate: Fix the vulnerabilities found
- Retest: Verify the fixes work
Frequency
Run red team exercises at least twice per year. Run them after major changes to:
- Agent capabilities (new tools added)
- Policy rules
- Infrastructure or deployment architecture
- Model versions
Building an internal red team
Your red team should include people who understand both AI and security:
- Prompt engineering expertise (how to manipulate models)
- Application security expertise (how to exploit traditional vulnerabilities)
- Domain expertise (how to exploit the specific tools your agent uses)
Document successful attacks and their mitigations. Use them as regression tests for future deployments.