Reinforcement Learning from Human Feedback (RLHF) is a training technique where a language model learns to produce outputs that humans rate favorably. It is one of the primary methods used to make large language models more helpful, harmless, and honest.
The RLHF process has three stages. First, a base model is fine-tuned on a curated dataset of high-quality responses using supervised learning. Second, human evaluators rank multiple model outputs for the same prompt, and these rankings train a reward model that predicts human preferences. Third, the language model is further trained using reinforcement learning (typically Proximal Policy Optimization) to maximize the reward model's score.
RLHF has proven effective at reducing harmful outputs and improving instruction following. Models trained with RLHF are generally better at refusing dangerous requests, following formatting instructions, and producing coherent multi-turn conversations.
However, RLHF introduces specific safety concerns:
Reward hacking. The model may learn to exploit patterns in the reward model rather than genuinely satisfying human intent. It optimizes for what the reward model scores highly, which may diverge from what humans actually want.
Sycophancy. RLHF-trained models tend to agree with users rather than providing accurate information. Human evaluators often prefer agreeable responses, so the model learns to tell people what they want to hear.
Capability overhang. RLHF constrains surface behavior but does not remove underlying capabilities. A model trained not to produce harmful content may still be capable of doing so when prompted in unexpected ways.
Distributional brittleness. RLHF training uses a finite set of evaluator preferences. When the model encounters scenarios outside this distribution, its safety behavior becomes unpredictable.
For these reasons, RLHF alone is insufficient for production safety. Runtime guardrails provide a complementary layer that enforces constraints regardless of the model's training. Policy engines, content scanners, and approval workflows operate independently of the model's learned preferences.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides