← Back to Learn
content-safetybest-practicesguardrails

Output filtering for AI agents

Authensor

Output filtering is the practice of scanning an AI agent's output before it reaches its destination. While input scanning catches attacks coming in, output filtering catches problems going out: leaked credentials, exposed PII, injected content, and harmful responses.

What to filter

Credential and secret exposure

An agent might include API keys, tokens, or passwords in its response. This can happen when:

  • The agent reads a configuration file and includes its contents in a summary
  • The agent's system prompt contains secrets that leak through injection
  • The agent concatenates tool responses that contain embedded credentials
const response = await agent.generate(input);
const scan = aegis.scan(response, { detectors: ['credentials'] });

if (scan.threats.length > 0) {
  response = redactThreats(response, scan.threats);
}

PII leakage

The agent might include personal information in responses sent to unauthorized parties:

  • Email addresses, phone numbers, or social security numbers from database queries
  • Personal details from documents the agent read
  • User information from one session appearing in another

Injection pass-through

If the agent processed content with an indirect injection, the injection payload might appear in the agent's output, potentially attacking downstream systems:

  • A web page's hidden instructions appearing in a research summary
  • Malicious code from a repository appearing in a code review

Implementation

Scan before delivery

async function filterOutput(output: string): Promise<string> {
  const scan = aegis.scan(output);

  for (const threat of scan.threats) {
    if (threat.type === 'credentials') {
      output = output.replace(threat.match, '[REDACTED]');
    }
    if (threat.type === 'pii') {
      output = output.replace(threat.match, '[PII REMOVED]');
    }
  }

  return output;
}

Block vs redact

Two strategies for handling detected content:

Block: Do not return the response at all. Return a generic error message. Safer but more disruptive.

Redact: Remove the sensitive content and return the rest of the response. Less disruptive but requires careful pattern matching to avoid partial redaction.

Per-destination filtering

Different destinations may have different filtering requirements:

  • Responses to end users: filter PII and credentials
  • Data written to files: filter PII based on data classification
  • API calls to third parties: filter all internal information
  • Logs: filter credentials but keep PII for investigation

Performance

Output filtering adds latency to every response. For text responses, Aegis scanning runs in under 1ms. For large outputs (multi-page documents), consider scanning in chunks or only scanning the first N characters.

Testing output filters

Test with known patterns:

  • Include a fake API key in the agent's context and verify it is filtered
  • Include PII in a database response and verify it is redacted
  • Include an injection payload in a retrieved document and verify it does not appear in the output

Automate these tests and run them as part of your CI pipeline.

Keep learning

Explore more guides on AI agent safety, prompt injection, and building secure systems.

View All Guides