0 2 4 5 0 2.5 5 human score LLM judge score y = x
Items in slice
100
Percent agreement
74%
Cohen's κ
0.61
Overall sample. Aggregate kappa looks healthy. Slice by length or position to see where it hides bias.