Constitutional AI
constitutional-ai
Constitutional AI (CAI) is Anthropic's variant of preference tuning where an AI model critiques and revises its own outputs against a written list of principles ("the constitution"), reducing the need for paid human labelers. The constitution is publicly documented and the model is trained to internalize it.
Standard RLHF requires humans to rank pairs of model responses, which is slow, expensive, and sensitive to whoever happened to be in the labeling pool. Constitutional AI replaces a chunk of that human labor with a programmatic critique loop.
The recipe roughly: write a constitution (a list of principles like "the model should be helpful, the model should refuse to facilitate illegal activity, the model should not claim subjective experience"). Take a model response. Ask the model to critique its own response against each principle. Ask it to revise the response to better satisfy the principles. Use the revised version as a higher-quality training example. Iterate.
The constitution is the editorial spine. It is the thing that makes one company's chat model feel different from another's, even when the underlying transformer architecture is similar. Anthropic publishes its constitution; the principles include things like "helpful, honest, harmless," explicit refusals on certain content categories, and meta-principles about how the model should handle ambiguity.
CAI's appeal is partly economic (less human labor) and partly transparent (you can read the constitution and reason about why the model behaves the way it does). It is also debatable as a safety technique: critics argue that having a model critique itself can lock in the model's existing biases rather than correcting them. Defenders argue that the explicit, written constitution is more auditable than a black-box pool of human preferences.
In practice, modern labs use a mix of human RLHF and AI-feedback (RLAIF) to balance scale, cost, and quality. Pure human RLHF and pure CAI are rare; the field has converged on hybrid pipelines. The principle of "the model refuses to claim subjective experience," which the chat on this site honors, is the kind of constitutional rule that carries from training into deployment.
Related concepts