All 100 items
Long responses
Short responses
Position A
Position B
Domain: math
Domain: trivia
0
2
4
5
0
2.5
5
human score
LLM judge score
y = x
Items in slice
100
Percent agreement
74%
Cohen's κ
0.61
Overall sample.
Aggregate kappa looks healthy. Slice by length or position to see where it hides bias.