Calibration
A property of a predictive model whose stated probabilities match observed frequencies in the long run. The single most important quality property for any betting model.
A model is calibrated if, among all events to which it assigned probability p, the observed frequency of those events occurring is also p. Calibration is distinct from accuracy: a model can be highly accurate (often correct) but poorly calibrated (overconfident or underconfident in its probabilities), and it can be perfectly calibrated while having modest accuracy. For betting and decision-making under uncertainty, calibration is the dominant quality criterion because every downstream decision (Kelly sizing, edge detection, expected value) depends on the probability being a faithful representation of true frequency.
- Kelly stake sizing amplifies probability error. A miscalibrated 70% prediction (when true frequency is 60%) bet at full Kelly will produce ruin even if the model wins more often than not.
- Edge detection is the difference between your probability and the market's. Both must be calibrated for the difference to be meaningful.
- Calibrated models compose. You can combine probabilities from a calibrated team-strength model with those from a calibrated injury-impact model. You cannot meaningfully combine miscalibrated ones.
- Calibration is testable empirically. Bin predictions by stated probability, observe actual frequency in each bin, plot the calibration curve. The diagonal is perfect; deviation is the calibration error.
Bin predictions by stated probability (e.g., bins of width 0.05). For each bin, compute the average stated probability and the observed frequency. Plot the points; perfect calibration is a 45-degree line. Quantitative measures include Brier score (mean squared difference between probability and outcome), expected calibration error (weighted average bin gap), and reliability diagrams. For binary outcomes the Brier score is the standard composite metric of accuracy plus calibration.
Take 1,000 model predictions. Bin those at stated probability 0.6 to 0.7. The average stated probability is 0.65. If the observed frequency in that bin is 0.50, the model is overconfident in this region by 15 percentage points. A reliability diagram makes the deviation visible across all bins at once.
- Reporting accuracy without reporting calibration. A model that is 70% accurate but systematically overconfident at the tails is still a bad bet; the headline number hides the failure mode.
- Calibrating in-sample. Calibration must be measured on held-out test data; in-sample calibration is meaningless because the model has seen the labels.
- Treating calibration as binary. Models are calibrated to varying degrees across different probability ranges. A model can be well calibrated at 0.4-0.6 and badly calibrated at the tails.
What is the difference between accuracy and calibration?
Accuracy measures how often the model's top-1 prediction is correct. Calibration measures whether the probabilities the model assigns match the actual frequencies of outcomes. A model that always says '60% chance of A' for an event that occurs 50% of the time has consistent accuracy on whichever way you grade top-1, but it is poorly calibrated. For betting, calibration is dominant because it controls every downstream EV calculation.
How is calibration improved if a model is found to be miscalibrated?
Post-hoc calibration methods like Platt scaling (logistic regression on the model output) or isotonic regression learn a monotonic transform from raw model probabilities to calibrated probabilities, fit on a held-out calibration set. This is standard practice in production ML systems for sports prediction. The transform is then applied to all live predictions.
Can a model be too confident or not confident enough?
Yes, both happen. Models trained with cross-entropy loss often become overconfident on classes near 0 or 1, especially with class imbalance. Models that average across an ensemble can become underconfident relative to the truth. Calibration assessment catches both.
Why is calibration central to HVP?
Because every win-rate or ROI claim implicitly assumes the model's probability outputs correspond to real frequencies. If they do not, headline numbers are an artifact of bin selection rather than a real edge. HVP rule 2 (Beta-Binomial CI lower bound) and rule 5 (empirical CLV haircut) are both corrections for the gap between model-stated and actually-realized probabilities.