Calibration
calibration
Calibration is whether a model's expressed confidence matches its actual accuracy. A perfectly calibrated model that says "I'm 80% sure" is right 80% of the time. Modern frontier models are imperfectly calibrated: they tend to be overconfident on hard questions and underconfident when they actually do know the answer.
Calibration is a different axis from accuracy. A model can be 90% accurate but badly calibrated (it says "I'm 99% sure" on the 10% it gets wrong). A model can be 70% accurate but well calibrated (it knows when it might be wrong and says so). For a real-world AI product, calibration often matters more than raw accuracy because users rely on the model's uncertainty signals to decide whether to act on the answer.
The classic calibration plot bins predictions by stated confidence and plots actual accuracy in each bin. A perfectly calibrated model would produce points on the y=x line. Most frontier language models curve below the line at high confidence (they say "very sure" when they should not be) and above the line at low confidence (they hedge unnecessarily).
The hallucination problem is partly a calibration problem. When a model invents a citation, it is not just generating wrong text; it is generating wrong text in a tone that conveys high confidence. A well-calibrated model in the same situation would say "I'm not sure, but I think it might be...". RLHF tends to flatten that signal: pleasant assistants do not hedge as much as they should.
Calibration can be improved through training (fine-tuning the model to express uncertainty when it should), through prompting (instructing the model to estimate its own confidence and to refuse when low), and through external scaffolding (running multiple samples and looking at agreement). None of these are perfect.
For evaluation purposes, expected calibration error (ECE) is the standard metric: how far off, on average, the model's stated confidence is from its actual accuracy in that confidence bucket. Lower is better. Modern models report ECE values that are not great by classical ML standards, but they are slowly improving.
The site's editorial promise is partly a calibration argument: "watch the meter" means watch the cost, but it also means watch the model hedge or commit, and learn when to trust each.
Related concepts