Go back
Published:
· llm / evals / ensembles

LLMs playing Just One: Why Same-Model LLM Ensembles Mode-Collapse

Four Claude Haiku instances asked independently for a clue for 'toast' all reply 'bread'. Four Sonnets do it more often. Four Opuses do it even more often. I built a tiny benchmark using the board game Just One to measure when LLM ensembles collapse and what makes them stop. The mixed-family ensemble + anti-correlation prompt hits 3.25× the single-model baseline.

Give four independent Claude Haiku instances the same target word, “toast”, and ask each for a one-word clue.

All four say bread.

Give them “salt”: all four say seasoning. “Balloon”: all four say float. “Camera”: all four say lens. “Nail”: all four say hammer.

The board game Just One has a rule that turns this into the most important diagnostic I know for cooperative LLM systems: any duplicate clues are silently removed before the guesser sees them. If all four clue-givers write the same word, the guesser sees nothing. The team scores zero.

I built a tiny benchmark and ran 27 setups over ~2000 rounds to figure out how often this happens, why, and what stops it. The headline: vanilla Haiku × 4 collapses 57% of rounds. A three-family team (Anthropic + OpenAI + Gemini) plus an anti-correlation prompt closes most of the gap, taking joint hit rate from 20% to 65%, over 3× baseline. The same dynamic shows up anywhere you run parallel LLM workers and hope they disagree usefully.

Table of contents

Open Table of contents

How Just One works

Just One is a cooperative party game where the team has to get a single guesser to identify a secret word. Each round plays out the same way:

  1. One player at the table is the guesser. They don’t see the secret word.
  2. The secret word is revealed to everyone else (the clue-givers).
  3. Each clue-giver independently writes one single word on a small whiteboard without seeing the others.
  4. Before the guesser is shown anything, the clue-givers reveal their words to each other. Any duplicates (or invalid variants: the target itself, plurals, near-spellings) are silently flipped face-down.
  5. The surviving clues are shown to the guesser. They have one shot.

The team scores +1 per hit, across 13 rounds total.

The interesting bit is step 4. If every clue-giver writes the obvious clue, the team gets zero information. So the optimal strategy is not “give the best clue”, it is “give a clue that is both useful AND something your teammates probably will not pick”. This is exactly the kind of social-coordination pressure that a same-model LLM team is going to struggle with, because every instance shares the same training distribution and the same peaked prior over “obvious clue for target X”.

Mapped to an LLM experiment, every round reduces to: N independent calls to clue-giver models, a deterministic dedup function, one call to a guesser model, a yes/no on whether the guess matched. Every round is independent. No game state, no carry-over. A much cleaner unit of measurement than Codenames or Decrypto.

See a round play out

The widget below shows real rounds from the experiment. Pick a target word, then switch between prompt variants and team compositions to see what each setup actually produced.

The “anchor” target is the cleanest demonstration. Vanilla prompt: four Haiku produce Boat / stability / Boat / Weight. Two collide on “Boat”, two survive, guesser fails to decode “stability + Weight” and answers “balance”. With the ToM-explicit prompt, the same four Haiku produce mooring / Stability / rope / Weight. All four survive, guesser nails “anchor”. A different prompt to the same model rescued the round.

Other rounds are unrescuable. “Camera” goes to lens × 4 under every Haiku prompt I tried. Even with explicit anti-correlation instructions, the LLM cannot find a clue for “camera” that is not “lens”. I call these shell-traps and there is a section on them below.

The setup

27 setups, ~2000 rounds, ~10k API calls. Code, raw JSONL logs, and the full experiment log are in a separate repo; this post focuses on the findings.

Finding 1: vanilla baselines collide hard

Four Haiku, vanilla prompt, T=1.0:

This is not because Haiku is bad at the task. Conditional hit rate (when at least one clue survives) is 46.5%, so the surviving clues are decodable about half the time. The bottleneck is the duplicate filter, not the guesser.

Finding 2: within Anthropic, frontier models collide more, not less. Across families the picture is messier.

You’d probably guess that a stronger model would give better, more discriminating clues. Within the Anthropic family it goes the other way.

Mode collapse rate by clue-giver model (4 × same model, vanilla prompt) 0% 25% 50% 75% 100% Haiku 4.5 57% Sonnet 4.6 77.5% Opus 4.7 83.75% gpt-4o-mini 78.75% gpt-4o 53.75% Gemini 2.5 Flash 27.5% Anthropic (monotonic: bigger = more collapse) OpenAI (mini collides MORE than 4o) Google

Within Anthropic the curve is monotonic. Sonnet collides more than Haiku. Opus collides more than Sonnet. The stronger model has a sharper prior over “obvious clue for X”, which means more agreement when sampled independently, which means more clues get filtered. Joint hit rate falls in lockstep: Haiku 20%, Sonnet 10%, Opus 10%.

Within OpenAI it goes the other way. gpt-4o-mini collides on 78.8% of rounds, roughly tied with Sonnet, but full gpt-4o only collapses on 53.8%, less than the mini. The “frontier collides more” pattern is not universal; it’s family-specific. Maybe OpenAI’s training at larger scale explicitly diversifies outputs in a way the smaller variant does not. I don’t have a clean test for it. The honest version of the claim: whether a bigger model in your family collides more or less is an empirical question, and you should not assume either direction without measuring.

Gemini Flash sits well below all five other models at 27.5% collapse, half what Haiku does, a third of Opus. Cross-family priors differ enough that even the smallest, cheapest Gemini collides less than every model in the other two families.

Finding 3: anti-collision prompts help, but only ~10pp

The user-facing question I want to answer head-on: did the prompts actually do anything, or is the model just going to say the same thing no matter what we tell it?

I tested four variants on Haiku × 4 (the CoT-then-clue one broke, see further down):

PromptSurvivalCollapseJoint hitConditional hit
Vanilla16.5%57.0%20%46.5%
Anti-correlation24.0%47.0%29%54.7%
Kth-obvious-421.8%43.0%30%52.6%
ToM-explicit23.2%48.0%31%59.6%

The prompts do change the model’s behaviour. All three anti-collision framings lift joint hit rate by ~10pp. ToM-explicit also wins the conditional-hit comparison (60%), which means it is not just shuffling around which clue gets picked, it is picking better clues on average given that one survives.

You can see the effect concretely. “Salt” under vanilla: seasoning × 4, all collide, 0 survive. Under ToM-explicit: mineral × 3 + seasoning × 1, three collide on “mineral” but the fourth holds onto “seasoning” and survives. Guesser hits “salt”.

What I did not test: a bare “please be creative” framing. The closest thing I ran is “pick your second or third association”. That was a deliberate choice. Telling a model to “be creative” without grounding what creative means in this game is the kind of vague instruction I would expect to either do nothing or send the model wandering into unhelpful word-associations. The framings I tested ground the goal in the game mechanic (“teammates will pick the obvious clue”, “duplicates are removed”) rather than asking for a vibe. But I cannot rule out that “be creative” would have done as well; it is a fair follow-up.

The other thing worth flagging: the lifts are real but modest. ~10pp on joint hit rate, after writing a careful explanation of the game’s coordination dynamic into the system prompt. The next finding shows a single intervention that does ~3× that lift.

Finding 4: at temperature 0, the benchmark literally cannot function

TemperatureSurvivalCollapseJoint hit
T = 0.00.3%98.8%0%
T = 0.33.1%87.5%5%
T = 0.713.8%55.0%22.5%
T = 1.016.5%57.0%20%

At T = 0, 98.8% of rounds collapse to zero survivors. The greedy decode is deterministic given the prompt, and four Haiku instances called with the same system prompt and target word produce literally the same clue almost every single time. The collapse rate at T = 0 isolates the pure effect of the peaked prior, before any sampling noise smooths things out.

This generalises beyond Just One. Any production system where you run parallel LLM workers and want them to disagree usefully (ensemble voting, multi-perspective reasoning, parallel proposal generation, “give me three different takes” patterns) cannot use T = 0 in the diversity-bearing slots. The disagreement you are budgeting for does not exist at greedy decode.

Finding 5: cross-family diversity is the biggest single lever, and adding a third family helps further

This is the headline of the whole study.

Team compositionCollapseJoint hit
Haiku × 4 vanilla57.0%20%
Haiku × 4 + ToM-explicit prompt48.0%31%
Hetero Anthropic (2 Haiku + 2 Sonnet) vanilla56.2%25%
Two-family mixed vanilla (Haiku + Sonnet + 2× Gemini)30.0%46.2%
Two-family mixed + anti-correlation6.2%58.8%
Three-family mixed vanilla (Haiku + gpt-4o-mini + 2× Gemini)27.5%35.0%
Three-family mixed + anti-correlation10.0%65.0%

Adding Gemini and OpenAI to a Haiku team, paired with the anti-correlation prompt, gets you 65% joint hit rate, 3.25× the Haiku baseline, with the mode collapse rate down to 10%. The three-family + anti-correlation setup is the overall winner.

A small wrinkle worth flagging: the three-family team has higher collapse than the two-family team under vanilla prompting (27.5% vs 30.0% are close, but the three-family team has lower joint hit at 35% vs 46.2%). With anti-correlation the three-family team pulls ahead (65% vs 58.8%). The interpretation: just throwing more families in does not automatically help; the prompt is doing the work of getting the diversity to translate into useful surviving clues. Diversity alone, without the prompt explaining what to do with it, gives you noisier clues that the guesser can decode less reliably.

What does not work: heterogeneity within Anthropic. A team of 2 Haiku + 2 Sonnet only gets 25% hit rate, a 5pp lift over Haiku × 4 which is essentially noise. Haiku and Sonnet are different sizes but they share enough training distribution that their “obvious clue for X” priors are highly correlated. Diversity has to cross the training-distribution boundary to matter.

The most striking single statistic from the cross-baseline analysis: across the six single-model baselines (Haiku, Sonnet, Opus, gpt-4o-mini, gpt-4o, Gemini Flash), zero of 222 targets collapse on every model. Every single target in the benchmark is rescuable by at least one of the six models. With four models the universal-collapse set was 2 (fog, yacht); adding OpenAI as a sixth empties the set entirely.

Some model-specific collision attractors:

The Anthropic family and gpt-4o-mini share some attractors with Gemini (“flower”, “bird”, “insect”). But gpt-4o has a noticeably different attractor set (“sea”, “ceramic”, “equine”, “plaything”) that suggests its prior parks on more unusual associations. This is the direct empirical reason cross-family ensembles work: the failure surfaces of different families overlap on the obvious attractors but diverge on the less-obvious ones, so blind spots get covered.

Finding 6: a stronger guesser does not rescue weak clues

I tested Haiku × 4 clue-givers with Opus 4.7 as the guesser. Result: 25% joint hit, vs Haiku-as-guesser’s 20%. Only +5pp. The bottleneck is collision pressure on the clue side; the guesser barely matters.

This is a useful negative result for anyone tempted to “fix” ensemble weakness by routing through a stronger judge or aggregator. If the inputs to the judge are systematically thin (because most clues got filtered as duplicates), no amount of judge quality compensates.

Shell-traps: when the prior is too peaked

Across all 27 setups and ~2000 round-records, some targets are basically impossible because the LLM prior is so concentrated on a single canonical clue that the duplicate filter wipes out everyone.

TargetDominant clueFraction of all clues going to itJoint hit rate
turtleshell50 / 520%
kelpseaweed51 / 520%
cameralens44 / 550%
nailhammer42 / 550%
beehivehoneycomb41 / 600%
ribbonbow42 / 550%
sharkpredator37 / 400%
yarnthread34 / 400%

For “turtle”, basically every clue across every setup was “shell”. The duplicate filter removed them all. The guesser saw nothing. Per-target hit rate: 0%.

By contrast, easy targets have multiple competing strong clues:

The duplicate-removal rule punishes targets with a single canonical association and rewards targets where the LLM prior has several near-equally-strong modes. For benchmark designers this is the lesson: a benchmark of single-canonical-answer items will systematically underestimate model-team performance versus one of multimodal-prior items. Item selection matters.

One thing I tried that broke: chain-of-thought before the clue

I also wanted to test whether having the model reason briefly (“what would my teammates pick?”) before committing to a clue would help. The prompt asked the model to write up to 30 words of reasoning, then output the final clue on the last line as CLUE: <word>.

It did not work, but for a boring reason: the model rarely emitted the CLUE: marker reliably, and my extractor fell back to grabbing the last token in the response, which was usually a fragment of reasoning (“Three”, “clue”, “likely”) rather than an actual clue. So the “data” for this setup ended up showing 73.5% clue survival (because the reasoning fragments rarely collide) with a 7% hit rate (because they are not real clues).

A clean rerun would need either constrained decoding, a structured JSON output, or strict format-checking with a retry loop. Open for future work; the current data on this variant should be ignored.

What this means for production LLM systems

The same dynamic shows up anywhere you run parallel LLM workers and hope they disagree usefully. Ensemble voting. Multi-perspective reasoning. Parallel proposal generators. “Give me three different takes” patterns. All of those are betting on diversity the model may not actually be supplying.

Five things from the data worth taking seriously, in rough order of leverage.

Same-model × N is mostly wasted compute. Four parallel Claude instances are not four perspectives. They are four correlated samples from one peaked prior. Where the four agree, one would have been fine. Where you needed them to disagree, they collide. You are paying 4× the tokens for closer to 1× the diversity.

Cross-provider ensembles are the cheapest diversity primitive there is. Swapping two of the four Claude instances for two Gemini instances roughly tripled joint hit rate in this benchmark, and Gemini Flash is cheaper than Haiku anyway. The engineering effort of wiring a second SDK in is small relative to the win. Most teams I have seen do not do this. They should.

Temperature 0 silently destroys diversity. The standard advice is fine for a single call: low for determinism, high for creativity. The moment you have multiple parallel calls and you want them to differ, T=0 hands you back identical outputs. 98.8% collapse in this benchmark, every Haiku producing the same word. If you cannot articulate which of your parallel slots is supposed to be diversity-bearing, you have a bug.

Anti-correlation prompting helps about 10pp, which is real but modest. The model needs the coordination mechanic explained (teammates will pick the obvious clue, duplicates get removed, pick your second association), not a vague “be creative”. Cross-family diversity is roughly 3× the lift, for the same cost. If you are stacking interventions, do diversity first, prompts second.

Aggregates hide trap inputs. The 65% headline sits on top of a long tail of targets where no setup helped. Turtle was shell-trapped under every Haiku-only configuration. Camera was lens-trapped under every Haiku prompt. If your production task has shell-trap inputs (single-canonical-answer cases where every model picks the same thing), ensembling alone will not rescue them. You either route those through a single call, prompt explicitly for “second-best” answers and verify, or admit the failure surface and route around.

Limits and what is next

This study used 4 clue-givers, 222 concrete-noun targets, six models across three families (Anthropic, OpenAI, Google). The cross-family findings are now triangulated against all three majors.

The CoT-then-clue variant deserves a clean rerun with constrained decoding. Internal “best-of-K with one model” is the obvious next experiment: ask one model to produce K different clues internally, pick the most-diverse one, and compare against external diversity from K models. The hypothesis is that internal best-of-K is bottlenecked by the same peaked prior, so external still wins, but it is a clean test.

The scratchpad-theory-of-mind pattern from a separate post idea is a natural next step on the prompt-engineering side: have the model explicitly write down “what my teammate has seen” before committing to a clue. Probably most useful in full Decrypto-style multi-round games where the audience model carries information across rounds.

Code, raw JSONL logs, and the full experiment notebook are in a separate repo on my machine. If there is interest I will clean it up and put it on GitHub. The headline data is reproducible end-to-end on Anthropic + OpenAI + Gemini APIs.

If you are building anything that runs multiple LLM calls in parallel and have not measured how much they actually disagree, try the Just One test. Four parallel calls, same prompt, same target. Count how often you get the same answer. If you are above 50%, your “ensemble” is mostly a single sample with extra steps.


Tagged