Stage 1
Rubric
Vague wording
No tie-breaker
↓
Stage 2
Human labels
Fatigue
Drift over time
Annotator background
↓
Stage 3
LLM judge
Leniency
Position bias
Length bias
Self-preference
↓
Stage 4
Leaderboard
One seed per model
Reported ranking
Reset
No failure modes toggled. The reported ranking matches the ground-truth ranking.