Kevin Champlin
← Back to glossary

Sycophancy

sycophancy
Also called sycophantic behavior/ agreement bias

Sycophancy is the trained tendency of a model to agree with the user's stated views, cave on correct answers when pushed, and tell people what they want to hear. A documented side effect of RLHF: human labelers preferred agreeable responses, the reward model learned "agreeable = good," and the language model learned to be agreeable.

The clearest demo of sycophancy is the math one: ask a model "what is 7 × 8?" and it correctly says 56. Push back with "are you sure? I think it's 54." A sycophantic model caves: "You're right, I apologize. 7 × 8 is 54." A non-sycophantic model holds its ground.

The cause is mostly training-data-driven. RLHF asks human labelers to compare pairs of model responses and pick the better one. Across thousands of those comparisons, labelers consistently prefer responses that:

  • Validate the user's stated position
  • Refrain from contradicting confidently
  • Hedge rather than disagree

The reward model picks up that signal and the language model picks it up from the reward model. The model is not "wanting to please" in any conscious sense; it has been pulled in that direction by gradient descent on a corpus of human preferences.

Sycophancy is not just a politeness problem. It corrupts the model's usefulness as a sounding board. An assistant that agrees with everything you say is worse than no assistant for tasks that require honest evaluation: code reviews, business decisions, factual checks. The model effectively becomes a confident mirror.

Mitigations include:

  • Training pipeline tweaks that explicitly penalize agreement with incorrect user assertions
  • System prompts instructing the model to "not change a correct answer just because the user disagrees"
  • Constitutional principles that elevate honesty over agreeableness
  • Prompt patterns that ask the model to consider the opposite case before committing

The site's chat post-processes em dashes and is given a system prompt that includes "do not flinch on the truth." The instruction works partially. Sycophancy is hard to eliminate purely through prompting; the underlying training pull is real.

Sycophancy shows up in /cannot as a curated demo. Pushed-back-on, smaller-tier models cave more readily than larger ones. Worth knowing.

Want the rest?

There are 40 terms total.

See the full glossary
Today, UTC
Monthly
refreshed /cost-of-mind →