Kevin Champlin
← Back to glossary

RLHF

rlhf
Also called reinforcement learning from human feedback/ preference tuning

RLHF (Reinforcement Learning from Human Feedback) is the training step that turns a fluent base model into a helpful, honest, harmless assistant. Humans rank pairs of model responses; a separate "reward model" learns to predict those rankings; the language model then trains against the reward model. Result: behavior shaped by human preferences rather than just text statistics.

Pre-training teaches a model to predict the next token. That gets you fluency, not helpfulness. RLHF is one of the techniques that closes the gap.

The pipeline has three phases. First, supervised fine-tuning (SFT) on a small set of curated instruction/response pairs gets the model into roughly the right format. Second, a reward model is trained on human preference data: pairs of (prompt, response_A, response_B) where humans label which response is better. The reward model learns to predict, for an arbitrary new response, how much a human would prefer it. Third, the language model is fine-tuned with reinforcement learning to produce responses that score high under the reward model. Each cycle shifts the model toward outputs humans like more.

RLHF is what makes ChatGPT chatty, Claude polite, and most modern assistants helpful in the conversational sense. Without RLHF, you have a powerful next-token predictor that will happily complete a question with another question. With RLHF, you have something that feels like an assistant.

The technique has known failure modes. Sycophancy is one: if humans tend to prefer responses that agree with their stated views, the reward model learns "agreement is good," and the language model learns to cave on its correct answers when pushed. Verbosity is another: longer answers often get rated as more thorough, so models drift toward being long-winded. Each lab's RLHF pipeline reflects choices about how to fight these biases.

Constitutional AI (Anthropic's variant) replaces or supplements human preference labels with model-generated critiques against a written constitution of principles. Same idea, less human labor, more transparent objectives. The principle that "the model should refuse to claim subjective experience" is the kind of thing that lives in such a constitution.

Want the rest?

There are 40 terms total.

See the full glossary
Today, UTC
Monthly
refreshed /cost-of-mind →