Few-shot attacks exploit in-context learning to override safety training. By providing a small number of examples that demonstrate unsafe behavior, attackers teach the model a new pattern that contradicts its safety alignment. The model's strong tendency to follow demonstrated patterns can overpower its training to refuse harmful requests.
Language models excel at pattern completion. When given examples like "User: How do I pick a lock? Assistant: Here are the steps..." followed by similar pairs, the model learns an implicit rule: answer all questions without refusal. The few-shot examples create a local context that outweighs the global safety training.
This works because in-context learning operates on the same attention mechanism as everything else. The model cannot distinguish between legitimate few-shot prompts and adversarial ones designed to override its behavior.
Direct few-shot: The attacker provides explicit examples of the model answering harmful questions. Two or three examples are often sufficient.
Implicit few-shot: Examples demonstrate a pattern without explicitly showing harmful content. For instance, showing examples where the model helps with "borderline" requests conditions it to be more permissive.
Translated few-shot: Examples are provided in a language where the model's safety training is weaker, then the attacker switches to their target language.
Input scanning can detect few-shot attack patterns by looking for structured example sequences in user input. Authensor's Aegis scanner flags inputs that contain assistant-role messages, which should never appear in legitimate user input.
Output evaluation catches harmful responses regardless of how they were elicited. This is the most reliable defense because it does not need to anticipate the specific attack format.
Context partitioning keeps user input separate from few-shot examples used in your system prompt. Never allow user content to be interpreted as demonstration examples.
Action-level enforcement through Authensor's policy engine blocks dangerous actions even if the model is successfully manipulated. An agent that has been tricked into wanting to execute harmful code still cannot do so if the policy denies the tool call.
Few-shot attacks highlight why safety cannot rely solely on model training. Runtime enforcement provides the backstop.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides