← Back to Learn
agent-safetyexplainerreference

AI Safety Glossary: Terms and Definitions

Authensor

Working in AI safety requires precise language. Ambiguous terminology leads to miscommunication between engineering teams, compliance officers, and leadership. This glossary defines the most important terms you will encounter when building, deploying, and securing AI agents.

Alignment refers to the degree to which an AI system's behavior matches human intentions. An aligned system does what its operators actually want, not just what it was literally instructed to do.

Guardrails are runtime controls that constrain agent behavior. Unlike alignment, which operates at the model level, guardrails are external enforcement mechanisms. They include policy engines, content filters, and approval workflows.

Fail-closed describes a design pattern where the system defaults to denying an action when safety checks are unavailable or return errors. This is the opposite of fail-open, where the system permits actions when checks fail.

Policy engine is a deterministic evaluation system that checks proposed agent actions against a defined set of rules. Unlike probabilistic classifiers, policy engines produce consistent, reproducible decisions.

Audit trail is a chronological record of all actions taken by an AI agent, including the policy evaluation results and any human approvals. Cryptographic audit trails use hash chains to make records tamper-evident.

Prompt injection is an attack where untrusted input manipulates the instructions an LLM follows. Direct injection targets the system prompt. Indirect injection embeds malicious instructions in data the agent retrieves.

Human-in-the-loop (HITL) describes a workflow where certain agent actions require explicit human approval before execution. This is distinct from human-on-the-loop, where humans monitor but do not gate individual actions.

Content safety scanning inspects both inputs and outputs for harmful, sensitive, or policy-violating content. Scanners typically check for PII exposure, toxic language, and prompt injection attempts.

Receipt is an immutable record of a policy decision. In Authensor, receipts are hash-chained, meaning each receipt references the hash of the previous one, forming a tamper-evident chain.

These definitions form the foundation for discussing AI safety architecture. Consistent use of these terms across your organization reduces confusion and accelerates decision-making.

Keep learning

Explore more guides on AI agent safety, prompt injection, and building secure systems.

View All Guides