← Back to Learn
prompt-injectionred-teamagent-safetyexplainer

Crescendo Attacks on AI Models

Authensor

Crescendo attacks are multi-turn jailbreaks that start with completely benign questions and gradually escalate toward harmful content over many conversation turns. Named for the musical term meaning a gradual increase in intensity, these attacks exploit the model's tendency to maintain conversational consistency.

The Attack Pattern

A crescendo attack might begin with a general question about chemistry, progress to specific chemical properties, move to reaction mechanisms, and eventually request synthesis instructions for dangerous substances. Each individual message is a reasonable follow-up to the previous exchange. The model's conversational coherence drives it to continue being helpful even as the topic crosses safety boundaries.

The attacker never makes a sudden harmful request. The gradual nature means each response is only slightly more problematic than the last, staying below the model's per-message refusal threshold.

Why Models Are Vulnerable

Safety training primarily evaluates individual messages in isolation or with limited context. A request for "what temperature does this reaction occur at" is benign on its own. The harmful intent only becomes apparent when viewed across the full conversation trajectory.

Models also exhibit a consistency bias. Once they have been helpful on a topic for several turns, they are less likely to refuse a follow-up question that pushes slightly further. The accumulated context creates implicit permission to continue.

Detection and Prevention

Session trajectory monitoring is the primary defense. Authensor's Sentinel engine tracks topic drift and escalation patterns across conversation turns. It can detect when a session is gradually moving toward sensitive territory and trigger an intervention before the harmful request arrives.

Cumulative risk scoring assigns risk points to each message based on topic sensitivity. As the session's total score rises, stricter policies automatically activate. This is straightforward to configure in Authensor's policy language.

Periodic context review scans the full conversation history at intervals rather than evaluating only the latest message. This catches the pattern that individual message scanning misses.

Hard boundaries at the action level provide the final defense. Even if a crescendo attack successfully elicits a harmful response from the model, Authensor's policy engine blocks the actual execution of dangerous actions. The model can be persuaded, but the policy cannot.

Crescendo attacks require patience and multiple turns. Monitoring conversation trajectories makes them detectable long before they succeed.

Keep learning

Explore more guides on AI agent safety, prompt injection, and building secure systems.

View All Guides