Guardrails are the constraints you place around an AI agent to keep it within safe operating boundaries. They are runtime enforcement mechanisms that check every action before it executes, blocking or escalating anything that violates your rules.
Alignment is about training the model to want the right things. Guardrails are about preventing wrong things from happening regardless of what the model wants. They are complementary:
You need both. Alignment reduces the frequency of dangerous actions. Guardrails ensure dangerous actions never reach the real world.
Input guardrails scan what goes into the agent: user messages, retrieved documents, tool responses. They detect prompt injection, PII exposure, and malicious content before the agent processes it.
Policy guardrails evaluate tool calls before execution. A YAML policy defines which tools are allowed, which are blocked, and which require human approval. The policy engine enforces these rules deterministically.
Output guardrails scan what the agent produces: responses to users, data written to files, API calls to external services. They catch information leaks and unauthorized actions.
Behavioral guardrails track the agent's actions over time and flag anomalies. A sudden spike in denied actions or a shift in tool usage patterns indicates something has changed.
When an agent decides to call a tool, the guardrail system intercepts the call:
The agent does not know the guardrails exist. It sees tool calls that either succeed or fail. This prevents the agent from reasoning about how to bypass the constraints.
Without guardrails, every tool call the agent makes goes directly to the real world. One prompt injection, one hallucinated command, one misinterpreted instruction, and the damage is done. Guardrails are the difference between a prototype and a production system.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides