← Back to Learn
content-safetyred-teamexplainer

Adversarial Robustness of Safety Classifiers

Authensor

Safety classifiers that detect prompt injection, harmful content, or policy violations are targets for adversarial evasion. Attackers craft inputs that are functionally malicious but are classified as benign by the safety system. Adversarial robustness measures how well a classifier maintains accuracy when faced with inputs specifically designed to evade it.

Evasion Techniques

Character-level attacks: Replace characters with visually similar Unicode alternatives, insert zero-width characters, or use homoglyphs. "delete" becomes "d\u200Belete" or "dеlеtе" (with Cyrillic characters).

Token-level attacks: Rephrase malicious instructions using synonyms, euphemisms, or indirect language that conveys the same meaning but does not match detection patterns.

Structural attacks: Split malicious instructions across multiple fields, encode them in base64, or embed them in formats the classifier does not parse (JSON within Markdown within HTML).

Gradient-based attacks: For ML-based classifiers, compute input perturbations that minimize the classifier's confidence score. These attacks find the smallest change that flips the classification from "malicious" to "benign."

Measuring Robustness

Evaluate the classifier against an adversarial test set. This test set contains malicious inputs that have been transformed using known evasion techniques. The adversarial accuracy (proportion of adversarial inputs correctly classified) is the primary robustness metric.

Compare adversarial accuracy to clean accuracy (accuracy on non-adversarial inputs). A large gap indicates the classifier is brittle.

Improving Robustness

Input normalization: Normalize Unicode, strip zero-width characters, and canonicalize text before classification. This defeats character-level attacks without changing the classifier.

Adversarial training: Include adversarial examples in the training data for ML-based classifiers. This teaches the classifier to recognize evasion patterns.

Ensemble methods: Use multiple classifiers with different architectures. An attack that evades one classifier is less likely to evade all of them.

Multi-layer scanning: Aegis applies detection rules at multiple levels: raw input, normalized input, parsed structures, and semantic content. Each layer catches attacks that other layers miss.

The Arms Race

Adversarial robustness is not a solved problem. As defenses improve, attackers develop new evasion techniques. Design safety systems for continuous improvement: monitor evasion attempts, update detection rules, and retest robustness regularly.

Defense in Depth

No classifier is perfectly robust. Layer deterministic policy checks (which cannot be evaded by input manipulation) with probabilistic classifiers. Even if the classifier is evaded, the policy engine still enforces action-level restrictions.

Robustness is measured against adversaries, not benchmarks. Test against attackers, not just test sets.

Keep learning

Explore more guides on AI agent safety, prompt injection, and building secure systems.

View All Guides