Stage 1Rubric
Vague wording
No tie-breaker
Stage 2Human labels
Fatigue
Drift over time
Annotator background
Stage 3LLM judge
Leniency
Position bias
Length bias
Self-preference
Stage 4Leaderboard
One seed per model
Reported ranking
No failure modes toggled. The reported ranking matches the ground-truth ranking.