Reinforcement Learning from Human Feedback

Wikipedia - Reinforcement learning from human feedback

Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning AI models with human preferences. The process works in two stages. First, human evaluators rank model outputs by quality, and a reward model learns to predict which outputs humans prefer. Second, the base model is fine-tuned using reinforcement learning to maximize the learned reward signal. OpenAI popularized the approach through InstructGPT (Ouyang et al., 2022). RLHF is now central to all major language models, including ChatGPT, Claude, and Gemini. The technique matters because it shapes model behavior without manually specifying rules, but whoever designs the feedback process also shapes the biases in the output.