AI Can't Bet on Sports? Wrong AI, Right Problem.
General Reasoning just released KellyBench — a study that tasked frontier LLMs with betting a full Premier League season. The results made headlines everywhere from Ars Technica to Benzinga: every model lost money, Grok went bankrupt, and Claude — the best performer — still finished down 11%.
The headline is fun. But the lesson isn't "AI can't bet on sports." It's that general-purpose language models aren't built for this.
What KellyBench Actually Tested
The study placed eight frontier AI systems — GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4.20, and others — into a simulated recreation of the 2023–24 Premier League season. Each model received detailed historical data, advanced statistics, lineups, past results, and public odds. Internet access was blocked. Their job: build models, identify edge, size bets, manage risk, and adapt as the season unfolded.
Each model got three attempts. Results ranged from Claude's average loss of 11% to Grok burning through nearly 90% of its bankroll before going bankrupt on its first attempt. No model scored higher than a third of available points on General Reasoning's 44-point sophistication rubric, developed with quantitative betting experts.
The authors' conclusion was pointed: AI handles static, rule-bound tasks well. Adapting to the continuous, chaotic flow of real-world data is a different problem entirely.
The Real Takeaway
Asking an LLM to build a profitable sports model from scratch inside a single prompt chain is like asking a brilliant generalist to walk onto a quant trading desk with no infrastructure, no backtesting framework, no feature pipeline, and no bankroll management system — and expecting them to print money by Friday.
Of course they failed.
But buried in the results is something more interesting. The models that survived longest shared traits that any quantitative bettor would recognize immediately:
- They retrained mid-season. Claude and GPT-5.4 adjusted their strategies in response to new match data rather than running a static model all year.
- They used systematic staking. Models that deployed structured position sizing — as opposed to ad-hoc bet amounts — avoided ruin. The Kelly Criterion exists for a reason.
- They preserved capital. The survivors sat out periods where their strategies identified no edge, rather than forcing action on every matchday.
These aren't LLM problems. These are sports modeling problems. And they have known solutions — solutions that require purpose-built ML pipelines, domain-specific feature engineering, rigorous backtesting, and disciplined bankroll management.
The Architecture Gap
The real takeaway from KellyBench isn't that AI fails at sports prediction. It's that the right AI architecture matters enormously.
A well-engineered gradient-boosted ensemble — XGBoost, LightGBM — trained on sport-specific features and validated out-of-sample across multiple seasons will outperform the most powerful LLM on this task every time. Not because LLMs aren't impressive, but because they're solving the wrong problem. Language models optimize for next-token prediction. Sports markets require probabilistic forecasting under uncertainty, with systematic risk management layered on top.
The KellyBench models that accidentally stumbled into survival did so by rediscovering practices that the quantitative sports world has refined for decades: fractional Kelly sizing, capital preservation in low-edge environments, and continuous model recalibration.
Ross Taylor, General Reasoning's CEO, put it well: "There is so much hype about AI automation, but there's not a lot of measurement of putting AI into a long-time horizon setting." He's right. And this is exactly the kind of long-horizon, high-complexity problem where purpose-built systems separate signal from noise.
What Purpose-Built Looks Like
At Victory Analytics and Research, this is the problem we've been engineering against from day one. Our simulation platforms are built sport-by-sport — NFL, NBA, college basketball, UFC — with dedicated feature pipelines, GPU-accelerated model training, and Kelly-based position sizing baked into the architecture.
We don't prompt a chatbot to place bets. We build quantitative systems designed to find and exploit edge, validated across years of out-of-sample data before a single dollar is risked.
The hype cycle wants AI to be magic. The reality is that beating efficient markets requires engineering, not eloquence.
KellyBench is a great benchmark. It just has the wrong headline.
Eric Wong is the founder of Victory Analytics and Research, building ML-powered sports simulation platforms for professional franchises, media companies, and prediction market operators.