The Honest
Validation Protocol.
Eight rules. Every published win-rate, ROI claim, and live-tier promotion has to pass all of them before it ships. The protocol is public, dated, and append-only. This page is the spec the pre-registration is signed against.
Sports analytics has a reproducibility problem.
Most published win-rate claims in the industry would not survive a serious audit. Single-season holdouts get marketed as walk-forward tests. Confidence intervals get inflated by Monte Carlo realizations of the same games. Tier thresholds get fit on the test set. Closing- line-value haircuts get applied as "industry standard" numbers rather than as empirically measured ones. The same model gets re-evaluated under different conditions every time the prior result starts to age.
The Honest Validation Protocol is the eight-rule spec VAR uses to ship through that. Every line of it is here because we caught ourselves doing it the wrong way at least once. The rules are not aspirations; they are the audit-defensible standard the public numbers have to clear.
What every published claim has to clear.
01 · Walk-forward across at least three independent test seasons
Train on seasons up to year N, test on year N+1 only. Slide the window forward and repeat. The reported figure is the aggregate across at least three independent test seasons, with per-season breakdowns surfaced wherever sample size supports it. A single- season holdout captures the variance of one season in one rule environment, and that variance gets conflated with skill. Three or more seasons forces the noise to wash out.
02 · Beta-Binomial 95% credible interval, cite the lower bound
Every win-rate claim ships with a 95% Beta-Binomial credible interval. The lower bound of that interval is the figure VAR plans and sizes against. Point estimates without intervals are memorization claims, not predictive-skill claims. The lower bound discipline trades a higher headline number for a number you can underwrite. At -110 closing, the break-even bar on the lower bound is 52.4%.
03 · Pre-registered ship gates before looking at test data
Thresholds that define a tier, the success criterion, and the explicit failure criteria are all locked before the test data is touched. If a tier's threshold gets chosen after seeing the performance grid, the result is in-sample fitting and the implied skill is whatever the fitting could squeeze out. VAR's 2026-27 NFL forward test lives at /methodology/2026-27-predictions as a public, signed, append-only contract.
04 · Independent samples only
The reported sample size is the number of independent betting decisions, not the number of Monte Carlo realizations of those decisions. Running the same model five times with different seeds does not produce five times the information about model skill; it produces five views of the same skill. Inflated sample counts produce confidence intervals that are artificially tight by a factor of roughly square-root-of-the-inflation, and the gap between the inflated lower bound and the true lower bound is where break-even claims get marketed as edge claims.
05 · Empirical closing-line-value haircut
The decay from opening price to closing price gets measured empirically, per market type, from actual graded bets. Not a heuristic "conservative" number applied uniformly across all markets. Spread, totals, team totals, and player props each have meaningfully different opening-to-closing behavior. Live ROI claims get the per-market haircut applied before they ship.
06 · Block-bootstrap bankroll simulation
Bankroll-growth and drawdown estimates come from block-bootstrap resamples of whole weekly slates, not from independent draws of individual bets. Bets within a slate are correlated (same week, shared market regime, shared model state), and block-bootstrapping at the slate level preserves the correlation structure that ruin- risk math actually depends on.
07 · Memory hygiene on superseded claims
When an audit retracts a prior win-rate claim, the prior claim gets dated and marked superseded; it never silent-updates into a new number. The change is visible in the published changelog: readers can see what was retracted and why. The presence of retracted claims, not the absence of them, is part of what makes the live numbers credible.
08 · Production-code-path verification
Every audit runs against the same code path that ships predictions to the live tracker. Validating against a clean-room re-implementation of the model produces numbers that do not transfer to deployment. The pre-registration covers the production tier; the audit covers the production tier; the live tracker traces back to the same tier.
The protocol is uniform; the data available isn't.
Public-data density varies meaningfully by league. The protocol applies the same way at every level, but the resulting quantitative claims are tighter at higher-rigor leagues and bounded at lower- rigor ones. Methodology-transparent disclosure becomes more important at lower-rigor levels, not less: honest framing of the rigor ceiling is what makes a moderate-rigor methodology defensible.
| League | Rigor tier | Why |
|---|---|---|
| NBA | Highest | Dense play-by-play, multiple seasons of structured data, robust public aggregate checks, mature analytics community. |
| NFL | High | Dense PBP, multiple seasons, strong public aggregate checks; lower sample density than NBA due to fewer games. |
| WNBA | Moderate | Structured PBP available; aggregate-check sources thinner than NBA; sample density limited by season length. |
| College Basketball | Moderate | Dense PBP and rich rosters; rapid roster turnover and large team count reduce model stability vs NBA. |
| College Football | Bounded | Public charting and tracking are sparser; talent-and-scheme variance is high; the rigor ceiling is the data ceiling. |
| Women's College Basketball | Bounded | First-of-its-kind public-data analytics surface in many cases; methodology-transparent disclosure of the bound is the point. |
Where the protocol shows up in public.
The protocol is the spec. Three surfaces apply it:
- /methodology/2026-27-predictions— the 2026-27 NFL forward test pre-registration. Signed before Week 1 kickoff, append-only, locks two pre-registered tiers (PRIME spread and PRIME_TOT) against pre-committed success criteria and three graded failure checkpoints.
- /performance — live tracking surface. Every PRIME-tier pick lands before kickoff. Every result lands within 24 hours of the game ending. Never back-dated. The running Beta-Binomial lower bound is on the page for each tier.
- /insights — postmortems and methodology notes. Postmortem SLA is 24 hours from a triggering event (checkpoint flag, conspicuous miss, audit retraction). The template is pre-built so response time does not become the constraint.
How to cite this protocol.
Suggested form:
Victory Analytics and Research (2026). Honest Validation Protocol, v2026-04-30. URL: https://victory-ar.com/methodology/protocol.
Revisions land as new dated entries; prior versions remain readable via the changelog. The signing date at the top of this page reflects the most recent active version.