Kevin Champlin
← Back to glossary

Alignment

alignment
Also called AI alignment/ value alignment

Alignment is the research problem of making AI systems behave in accordance with human values and intentions, especially as systems become more capable. The technical agenda includes RLHF, constitutional AI, interpretability, scalable oversight, and various proposals for keeping a future, smarter system pointed in a direction humans actually want.

The alignment problem comes in two shapes. The narrow version: how do we get a chat model to refuse harmful requests, follow instructions accurately, and avoid generating misinformation? This is the version most products deal with. RLHF, fine-tuning on safety datasets, content filters, and constitutional-AI-style critique loops are all alignment techniques in the narrow sense.

The broad version: as AI systems get more capable, how do we ensure they pursue goals we actually endorse rather than goals that look like ours from a distance? This is the version philosophers and safety researchers worry about. The concern is that a sufficiently capable system trained on a flawed objective might find ways to satisfy the literal objective without satisfying the intent. Standard examples involve specification gaming, reward hacking, and out-of-distribution generalization failures.

The major research labs treat alignment as core to their work. Anthropic was founded around it. OpenAI has had a "Superalignment" team (with internal turbulence). Google DeepMind has its own alignment group. They publish papers, run open evaluations, and increasingly include alignment numbers in model launches.

The honest state of the field as of 2026: narrow alignment works pretty well. A modern frontier chat model refuses most clearly harmful requests, hallucinates less than its predecessors, and behaves predictably enough to ship to consumers. Broad alignment remains an open research problem. Nobody has a confident technique for ensuring that a model substantially smarter than its trainers stays pointed where the trainers wanted.

Critics of the broad alignment framing argue that "alignment" smuggles in assumptions: whose values, which interpretation, and so on. Defenders argue the technical work is largely orthogonal to the political framing. Both points have merit. The site's position is the technical one: the editorial spine is "capability without consciousness," and alignment in the narrow sense is the practical question of keeping that capability useful and not harmful in deployment.

Want the rest?

There are 40 terms total.

See the full glossary
Today, UTC
Monthly
refreshed /cost-of-mind →