Glossary/Model Quality

Calibration

Also called Probability Calibration

A property of a predictive model whose stated probabilities match observed frequencies in the long run. The single most important quality property for any betting model.

Definition

A model is calibrated if, among all events to which it assigned probability p, the observed frequency of those events occurring is also p. Calibration is distinct from accuracy: a model can be highly accurate (often correct) but poorly calibrated (overconfident or underconfident in its probabilities), and it can be perfectly calibrated while having modest accuracy. For betting and decision-making under uncertainty, calibration is the dominant quality criterion because every downstream decision (Kelly sizing, edge detection, expected value) depends on the probability being a faithful representation of true frequency.

Why It Matters
How to Compute

Bin predictions by stated probability (e.g., bins of width 0.05). For each bin, compute the average stated probability and the observed frequency. Plot the points; perfect calibration is a 45-degree line. Quantitative measures include Brier score (mean squared difference between probability and outcome), expected calibration error (weighted average bin gap), and reliability diagrams. For binary outcomes the Brier score is the standard composite metric of accuracy plus calibration.

Example

Take 1,000 model predictions. Bin those at stated probability 0.6 to 0.7. The average stated probability is 0.65. If the observed frequency in that bin is 0.50, the model is overconfident in this region by 15 percentage points. A reliability diagram makes the deviation visible across all bins at once.

Common Mistakes
Frequently Asked

What is the difference between accuracy and calibration?

Accuracy measures how often the model's top-1 prediction is correct. Calibration measures whether the probabilities the model assigns match the actual frequencies of outcomes. A model that always says '60% chance of A' for an event that occurs 50% of the time has consistent accuracy on whichever way you grade top-1, but it is poorly calibrated. For betting, calibration is dominant because it controls every downstream EV calculation.

How is calibration improved if a model is found to be miscalibrated?

Post-hoc calibration methods like Platt scaling (logistic regression on the model output) or isotonic regression learn a monotonic transform from raw model probabilities to calibrated probabilities, fit on a held-out calibration set. This is standard practice in production ML systems for sports prediction. The transform is then applied to all live predictions.

Can a model be too confident or not confident enough?

Yes, both happen. Models trained with cross-entropy loss often become overconfident on classes near 0 or 1, especially with class imbalance. Models that average across an ensemble can become underconfident relative to the truth. Calibration assessment catches both.

Why is calibration central to HVP?

Because every win-rate or ROI claim implicitly assumes the model's probability outputs correspond to real frequencies. If they do not, headline numbers are an artifact of bin selection rather than a real edge. HVP rule 2 (Beta-Binomial CI lower bound) and rule 5 (empirical CLV haircut) are both corrections for the gap between model-stated and actually-realized probabilities.

See Also