Break the chain: telephone game

Stage 1Rubric

Vague wording

No tie-breaker

↓

Stage 2Human labels

Fatigue

Drift over time

Annotator background

↓

Stage 3LLM judge

Leniency

Position bias

Length bias

Self-preference

↓

Stage 4Leaderboard

One seed per model

Reported ranking

No failure modes toggled. The reported ranking matches the ground-truth ranking.