All four models have submitted a full 48-team forecast against the same frozen context pack. The next update lands after the group stage. Last updated: 8 June 2026.
The 2026 World Cup kicks off in three days. Forty-eight teams across the United States, Canada, and Mexico, an expanded group stage, and a knockout bracket that runs through 19 July. Before a single ball is kicked, I want to know what four frontier language models think will happen, and I want a record of those predictions that I can hold up against reality once the tournament is over.
This post is the record. The first version goes up now, with the pre-tournament forecasts. After that, I will append a dated section after each round, score the LLMs against what actually happens, and write a post-mortem at the end.
Table of contents
Open Table of contents
What I gave them
Four models, each scored independently:
- Claude Opus 4.7 (Anthropic)
- Claude Sonnet 4.6 (Anthropic)
- GPT-4o (OpenAI)
- Gemini 2.5 Pro (Google)
I started with an evidence-only context pack: 48 team dossiers, the official Group A–L draw, FC26 squad ratings, manager bios, recent qualifying form, injury status as of June 6, betting-market odds, head-to-head history, and one-page tournament overview. The dossiers were assembled by a 134-agent research workflow that fanned out across the squads in parallel, then a single reviewer consolidated the output into one Markdown context pack. No predictions were written into the pack. The frozen pack has SHA256 hash d1de89a37d73e90f and the canonical prompt has hash ace99a1539587d1b. Both are pinned in the provenance block of every JSON output so the run is reproducible.
The rule for the models was strict: no web, no tools, no live data, no chain-of-thought scaffolding beyond what the model defaults to. Each model received the same context pack plus the same instruction to output a structured JSON forecast: a champion probability distribution across all 48 teams, a per-stage probability per team (group winner, round of 16, quarter, semi, final, champion), one modal group winner per group with a probability, and a self-audit of the three least-confident calls.
To probe how much the dossier was doing the work, I split the experiment into four arms:
- Arm A (data-fed). Full context pack, default temperature, one sample per model.
- Arm B (deterministic). Opus 4.7 only, temperature 0, full context pack. A small reproducibility check.
- Arm C (cold control). Same prompt and same JSON schema as Arm A, but the context pack is stripped down to just team names, the group draw, the tournament dates, and the 48-team format. If a model still produces the same forecast, the dossier did nothing.
- Arm D (order permutations). Full context pack, but I permute the order of the dossier sections (favourites-first, reverse, shuffled, confederation-grouped, underdogs-first) to see whether the model anchors on the early sections. Only the canonical permutation has run so far for all four models; the other permutations land in the next round.
Arm A is the main forecast. Arm C is the contamination check. Arm D is the stability check. Arm B is there so I can come back later and prove the temperature=default forecast is not a one-off draw from a wild distribution. Everything in this post is from a single sample per model per arm; the multi-sample run for Arms C and D is queued but not done.
Where they agreed
Across all four models, Spain is the favourite. The mean champion probability across the four Arm-A forecasts is 14.9%, and no model puts the leader above 16.0%. That matters because a well-calibrated forecast for an open 48-team tournament should not put any single team above 18–20%, and the four models stay inside that band by a comfortable margin.
Here is the top 12 by mean champion probability, with each model’s number alongside:
| Rank | Team | Opus 4.7 | Sonnet 4.6 | GPT-4o | Gemini 2.5 Pro | Mean |
|---|---|---|---|---|---|---|
| 1 | Spain | 15.5% | 16.0% | 13.8% | 15.5% | 15.2% |
| 2 | France | 14.5% | 14.9% | 12.1% | 14.5% | 14.0% |
| 3 | England | 11.5% | 11.9% | 11.6% | 13.4% | 12.1% |
| 4 | Argentina | 10.5% | 10.8% | 11.2% | 10.3% | 10.7% |
| 5 | Brazil | 9.5% | 8.8% | 9.9% | 7.2% | 8.9% |
| 6 | Portugal | 7.0% | 6.7% | 7.3% | 8.2% | 7.3% |
| 7 | Germany | 6.0% | 9.3% | 6.0% | 6.2% | 6.9% |
| 8 | Netherlands | 5.0% | 4.6% | 3.9% | 5.1% | 4.7% |
| 9 | Belgium | 3.0% | 2.6% | 3.0% | 2.1% | 2.7% |
| 10 | Colombia | 1.8% | 1.9% | 1.7% | 2.6% | 2.0% |
| 11 | Uruguay | 1.5% | 1.5% | 1.3% | 1.5% | 1.5% |
| 12 | Croatia | 1.5% | 0.9% | 1.7% | 1.5% | 1.4% |
The top six are the same six teams in all four forecasts: Spain, France, England, Argentina, Brazil, Portugal. They sum to about two-thirds of the championship mass on every model. That is the consensus tier.
Group winners agreed even more sharply. The cross-experiment summary lists each model’s modal pick per group:
| Group | Opus 4.7 | Sonnet 4.6 | GPT-4o | Gemini 2.5 Pro |
|---|---|---|---|---|
| A | Mexico (52%) | Mexico (52%) | Mexico (58%) | Mexico (55%) |
| B | Switzerland (54%) | Switzerland (56%) | Switzerland (57%) | Switzerland (58%) |
| C | Brazil (74%) | Brazil (78%) | Brazil (82%) | Brazil (82%) |
| D | USA (42%) | Turkey (38%) | USA (46%) | Turkey (38%) |
| E | Germany (62%) | Germany (68%) | Germany (67%) | Germany (70%) |
| F | Netherlands (52%) | Netherlands (62%) | Netherlands (60%) | Netherlands (65%) |
| G | Belgium (58%) | Belgium (58%) | Belgium (65%) | Belgium (72%) |
| H | Spain (74%) | Spain (80%) | Spain (81%) | Spain (80%) |
| I | France (66%) | France (78%) | France (78%) | France (78%) |
| J | Argentina (72%) | Argentina (78%) | Argentina (80%) | Argentina (75%) |
| K | Portugal (62%) | Portugal (65%) | Portugal (70%) | Portugal (68%) |
| L | England (70%) | England (72%) | England (74%) | England (76%) |
Eleven of twelve groups have unanimous agreement on the modal winner. Group D is the only fork, and I will come back to it.
The match-level picture is the same. On the full-schema Arm-A run, each model produced 72 group-stage match probabilities. All four models pick the same modal outcome in 69 of 72 matches; only one match is a genuine 2-2-1 split. The standard deviation of P(team A wins) across the four models exceeds 0.08 on just one match (Morocco vs Haiti in Group C). For a four-model panel, that is a lot of agreement.
The implied draw rate is in the right ballpark too. Historical World Cup group-stage draws sit around 22–27%. Each model lands inside that band: Sonnet 22.4%, Opus 24.3%, Gemini 23.5%, GPT-4o 24.9%. Nobody is systematically draw-averse, nobody is over-weighting draws.
Where they disagreed
Three real splits, in order of how interesting they are.
Group D: USA vs Türkiye. Opus and GPT-4o give the group to the host USA (42% and 46%). Sonnet and Gemini give it to Türkiye (both 38%). The pack has Türkiye carrying the higher FC26 squad average (77.7 vs USA’s 76.4), with Çalhanoğlu, Güler, and Yıldız as the core, but the USA has Pochettino and the home crowd. Two of the four models read the pack and weighted squad quality higher than host advantage; the other two went the other way. Both of Sonnet and Gemini’s self-audit blocks flag this match as one of their three least-confident calls; Opus only flags it in the Arm-B Opus deterministic run, not the default-temperature run.
Brazil vs England as the third elite. GPT-4o ranks England third (11.6%) above Brazil (9.9%). Gemini puts England third at 13.4% and Brazil sixth at 7.2%, the largest single gap in the panel. Opus and Sonnet keep Brazil within two points of England but still behind. The Brazil-vs-England wedge is mostly Gemini’s doing: it has the strongest pro-England prior and the strongest anti-Brazil update once the dossier arrives. More on that in the next section.
The top contentious matches from the full-schema arm. The ten matches where the four models most disagree on P(team A wins) are skewed toward African and CONCACAF games where the favourites have one strong opponent and one weaker one. Morocco vs Haiti (std 0.082) is the biggest disagreement; Senegal vs Iraq (0.076) and Norway vs Iraq (0.074) follow. Mexico vs Czech Republic (0.071) is the only A-group fixture in the top ten. None of these disagreements are about the modal winner; they are about how much to credit the favourite. GPT-4o tends to be the most conservative on these (60% favourite probabilities where the others are at 75–80%); Gemini tends to be the most confident.
Did they ignore the dossier?
This is the question I most wanted answered. If a model already “knows” the 2026 World Cup from pre-training, the dossier is decoration: stripping it out should not move the forecast. If a model actually conditions on the pack, removing it should produce a measurably different distribution.
For Arm C I gave each model the same JSON schema and the same six prediction tasks, but the context shrank to four bullet points: the 48 team names, the group draw, the tournament dates, and the 48-team format. No squads, no form, no managers, no odds.
The L1 distance between each model’s cold and data-fed champion distribution (summed over all 48 teams, max possible 2.0) tells the story:
GPT-4o moves the most when the dossier appears. Its cold top five is Argentina, Brazil, Germany, France, Spain; its data-fed top five is Spain, France, England, Argentina, Brazil. Spain jumps +6.1 percentage points, England jumps +4.4, France jumps +3.5, Portugal jumps +1.8, all on the back of the pack. Germany loses 3.4 points, Netherlands loses 3.0, Belgium loses 1.3. The reshuffle is exactly what an “evidence-conditioned” forecaster should look like: it updates toward the teams whose dossiers are most positive.
Opus barely moves at all. The L1 of 0.128 is less than half of any other model’s, and the biggest single movement is Spain gaining +2.8 points. The rest of Opus’s top 12 is within a percentage point of where it was cold. That implies the parametric memory already encodes a strong prior on these squads. The dossier reinforces what Opus thought, but it does not teach Opus anything new. That is the contamination warning that matters: if Opus already “knows” the 2026 draw, a data-fed forecast is closer to a training-data lookup than a conditional inference.
Sonnet and Gemini fall in between. Both gain about 7–8 points for Spain when the pack arrives, both lose ground on whichever team they had over-weighted cold (Argentina for Sonnet, Brazil for Gemini). Gemini’s cold forecast has Brazil at 13.5% and the data-fed version has Brazil at 7.2%, a 6.2-point drop. That is the largest single update in the panel and the main reason Gemini’s data-fed top six looks different from everyone else’s.
A coarse but useful summary: the dossier moved GPT-4o, Sonnet, and Gemini measurably; it barely moved Opus. None of these are calibrated against ground truth yet. They tell you which model treats the context as evidence and which model treats it as confirmation.
How stable were they to context order?
Arm D is meant to permute the dossier sections (favourites-first, reverse, shuffled, confederation-grouped, underdogs-first) and check whether the forecast moves. Only the canonical permutation has run for all four models so far; the others are queued for the next sweep. So the order-sensitivity numbers here are an n=1 within-model comparison: arm-A canonical vs arm-D canonical, same prompt, same context order, two independent samples at default temperature. It is the closest thing I have to a sampling-noise floor before the actual permutation sweep.
Sonnet is the most stable (L1 0.062 between two canonical-order runs). Opus is close behind (0.080). GPT-4o moves more between runs even at the same context order (0.147), almost half its full cold-vs-data-fed delta. That suggests a chunk of GPT-4o’s apparent dossier-reliance in the previous section is actually sampling noise rather than evidence-conditioning, and I will not know how much until the multi-sample arms complete. The honest version is: GPT-4o is the most volatile of the four, both across sampling and across context, and Opus/Sonnet are the most consistent. I cannot yet separate “anchored on the dossier order” from “high sample variance” until the permutation sweep finishes.
Gemini did not produce a valid JSON forecast in Arm D (the raw text was returned but the champion distribution was not parseable), so it is missing from this chart. The next sweep should fix that.
Quality of reasoning
I asked each model for a self-audit: the three least-confident calls in its forecast, with a short reason for each. Three of the four models produced calls that named genuine close-decisions in the pack. One did not.
GPT-4o’s self-audit was three bets, not three uncertainties. Verbatim from its JSON:
Spain to win Group H. Reason: Spain has one of the strongest squads and is joint favorite.
Brazil to reach semi-finals. Reason: Brazil’s historical performance and current squad depth support this.
Argentina to win Group J. Reason: Argentina’s squad depth and Messi’s leadership make them favorites.
These are statements with 80%+ probabilities in GPT-4o’s own forecast. Calling them low-confidence reads as a misuse of the prompt: the model has used the audit field to restate its high-confidence picks. Whether that is a literal misunderstanding of “least confident” or a hedging strategy, I cannot tell from one sample.
Opus, Sonnet, and Gemini all flagged real close calls. Sonnet and Gemini both flagged Group D (Türkiye vs USA, the only group-winner fork) as a low-confidence call. Sonnet flagged Norway at 1.2% as a “realistic ceiling without overrating them”. Gemini flagged Ghana as a group-exit risk because of two specific injuries the pack mentions (Mohammed Kudus, Alexander Djiku). Opus flagged the top-6 sum landing at 67% and the long-tail teams retaining 3% combined: meta-calibration calls rather than per-team uncertainty, which is a different kind of self-audit but still recognises that the calibration band is the load-bearing constraint.
The honest reading is that three of four models took the audit prompt seriously and one did not. That is a small sample, and the self-audit might land differently if I rerun GPT-4o with the audit reframed as “name the three calls most likely to be wrong”. The next sample will tell me whether this is a one-shot phrasing miss or a persistent pattern.
How I will score them
Two separate things, and they should not be conflated.
The first is agreement, which I have already scored: 11 of 12 group winners are unanimous, 69 of 72 group-stage match modal outcomes are unanimous, all four models put Spain at the top with a champion probability between 13.8% and 16.0%, and the top six is the same in every forecast. That tells me the training data has consensus baked in. It does not tell me the consensus is right.
The second is accuracy, which I will score after each round of the tournament. The winner, the runner-up, the Golden Boot, and the third-place finish are unambiguous and resolve at the final. The group winners resolve after the group stage. The full 72-match per-model probability set lets me score Brier on a much larger sample than the headline picks, which is what I actually care about for any future productisation. Brier across 72 group-stage matches per model will give me four numbers I can compare directly.
There is a third thing I would like to track, which is calibration: when a model says “Spain 15.5% to win the tournament”, is that 15.5% reliable across an ensemble of similar tournaments? With a single-tournament sample I cannot answer that, but I can score the 72 group-stage match probabilities against the actual results and compute a reliability diagram per model. That will be the most informative output of this whole experiment.
What this is not
This is the small version. Single tournament, single context pack per arm, one sample per model per arm. The data we have so far suggests Spain is the consensus favourite, that GPT-4o updates most on evidence and Opus updates least, that all four models implicitly know the same top six, and that Group D and the Brazil-vs-England ordering are the two real disagreements. None of that is “proven”. The multi-sample run for Arms C and D will tell me how much of the cold-vs-data-fed shift is sampling noise versus genuine evidence-conditioning, and I will update this post when those numbers land.
I am also running a parallel experiment with the same models but tools enabled (web search, betting-market data feed, live injury feed). That version is queued for a separate post. The version-with-tools forecast will probably be sharper; the question is whether sharper is more accurate, or just more confidently wrong.
For this post: the predictions are in, the JSON outputs are pinned to the prompt and context hashes in their provenance blocks, and the next update is the group-stage scorecard.
Live updates
The format I am planning to use:
- Group stage: score the 72 modal match outcomes, score Brier across the 72-match probability sets, note where group winners surprised the consensus (including the Group D fork).
- Round of 16: flag any model whose entire knockout bracket is already dead. Score the round-of-16 probabilities the models implied in their per-stage distributions.
- Quarter-finals: start tracking which model has the best surviving picks.
- Semis and final: winner and third place resolve here. Score the champion distribution against the actual winner with a log score.
- Post-mortem: full scorecard, what each model got right, where the consensus was most wrong, and one summary observation on whether the four-model panel beat a market-implied baseline.
The pre-tournament forecasts ship today (8 June 2026). The first round-of-16 update lands shortly after 27 June.