Breaking Down Agent Evals (Part 1B): Eval Calibration

Part 1B of the agent-evals series. Part 1 covered the conceptual frame. Part 1A showed the code skeleton. This post is the missing middle: how to know that the numbers your suite is reporting actually mean what you think they mean.

Open Table of contents

What an eval is
What “the judges” do
What calibration is
Why calibration matters: the chain of trust
The four levels of calibration
Cohen’s kappa, in plain language
- When kappa lies: the prevalence paradox
Score calibration vs ranking calibration
A short tour of LLM judge biases
A calibration checklist
What’s next
References

What an eval is

An eval is a triple: a set of inputs, a model that produces outputs from them, and a scoring function that decides whether each output was good. You feed inputs through the model, you score the outputs, you get a number. That number is your eval.

The scoring function has several shapes. Exact match (the answer is “Paris” or it isn’t). Regex (matches /^[0-9]{4}-[0-9]{2}-[0-9]{2}$/ or doesn’t). A classifier (passes a learned thresholding model). A human grader applying a rubric. An LLM applying a rubric. Real eval suites use several of these in combination.

The eval is only as trustworthy as the weakest link. A perfectly assembled benchmark with a noisy scoring function is a noisy eval. A perfect scoring function applied to a vague rubric is a vague eval. The rest of this post is about what happens when the scoring function is itself fallible, and how to know whether yours is.

What “the judges” do

When the scoring function is a person (or an LLM acting as one), we call them a judge. The judge takes an input, a candidate response, an optional reference answer, and a rubric, and returns a score or a label. Standard judge designs include binary correct/incorrect, ordinal 1-5 scores, pairwise preference between two candidates, and free-form critique with an attached score.

The design of the judge itself is a research problem. Pairwise comparison (“which of these two responses is better?”) is generally more reliable than absolute scoring (“rate this response 1 to 5”). Reference-based grading is more reliable than reference-free. Judges can be human or LLM; the principles below apply to both, with LLM-specific failure modes layered on top.

What calibration is

Calibration is the process of checking and adjusting your scoring pipeline so that its outputs reflect what your rubric actually says about quality. Not accuracy. A judge that says “5/5” on every input has high agreement with itself, but it has no calibration: it isn’t tracking quality, it’s tracking nothing.

Calibration applies at every link in the chain. The rubric is calibrated against itself (do two careful readers interpret it the same way?). The human labellers are calibrated against the rubric (do they agree with each other when grading the same examples?). The LLM judge is calibrated against the humans. And the LLM judge is calibrated against itself across runs and against sibling models.

The output of a calibration pass is one of two things. Either confidence that the eval can be trusted to rank candidate models against each other and against a target. Or a list of changes to make to the pipeline before the number on the dashboard means anything.

Why calibration matters: the chain of trust

The reason you cannot skip this is that every link in the eval chain has a characteristic failure mode, and every one of those failures turns the dashboard number into a number you can’t act on.

The realistic failure modes, one per link in the chain:

Ambiguous rubric. Two careful graders read the same rubric and disagree on what counts as a “good” answer.
Labeller fatigue. By the 200th response in a batch they start clicking the easy option.
Labeller drift. The standard on Monday isn’t the standard on Friday.
Annotator background. A bilingual reviewer scores a translation differently than a monolingual one.
LLM leniency. Judge LLMs say “correct” or hand out 5/5 more often than humans do.
Position bias. Swap the order of two responses and the judge flips its preference.
Length bias. Longer responses look more thorough to a judge even when they aren’t better.
Self-preference. A judge from family X tends to favour responses from family X.
Prompt sensitivity. Thakur et al. (2024) showed that adding more rubric detail can make weaker judges worse, not better.

Any one of these on its own is recoverable. Several of them stacked is how you ship an evaluation system that confidently ranks the worst model first.

The applet below makes this concrete. The four boxes are the four stages of the eval chain: rubric, human labels, LLM judge, reported leaderboard. Each stage has toggles for the realistic failure modes that show up at that stage. The leaderboard on the right reranks as you flip toggles. Try clicking three or four.

With no toggles, the leaderboard is the ground-truth ranking. Two or three toggles in and Model E is winning, even though Model E was originally fifth out of five. None of these failure modes are exotic; every one of them shows up in production eval pipelines that don’t have calibration discipline.

The four levels of calibration

Calibration is a ladder. Each level rests on the one above being trustworthy.

Level 1: Rubric calibration. Does the rubric mean the same thing to two careful readers?
Level 2: Human-to-human calibration. Do two trained labellers agree on the same examples?
Level 3: LLM-to-human calibration. Does the LLM judge agree with the human labellers?
Level 4: LLM-to-LLM calibration. Is the LLM judge consistent with itself across runs and with sibling models?

A failure at Level 1 makes Level 2 meaningless. A failure at Level 2 makes Level 3 meaningless. You climb the ladder in order.

Level 1: rubric calibration

The rubric is the hidden assumption that everything else grades against. If the rubric is vague, every downstream signal is noisier than it has to be, and you’ll mistake that noise for model variance.

The test is cheap: hand the rubric to two careful humans cold, ask them to grade the same 20 examples, and look at the disagreements. If they disagree on more than a couple, the rubric is the bug, not the labellers. Fix patterns: replace vague adjectives like “clear” or “helpful” with operationalised criteria (“answers the user’s question in fewer than 80 words” rather than “is concise”), add positive and negative examples for the borderline cases, and force a tie-breaker rule so a labeller never has to invent one in the middle of a batch.

Level 2: human-to-human calibration

This is the ceiling. No LLM judge can be measurably more reliable than the humans you compared it against. If your humans agree with each other 65% of the time and the LLM judge agrees with humans 60% of the time, the LLM judge is near ceiling, not failing.

The standard measurement is inter-annotator agreement (IAA). Two common forms: percent agreement (the simple “what fraction of items did they label the same way”) and Cohen’s kappa (the same number, corrected for the agreement you’d get by random chance). Kappa is the one to report because it survives the kappa paradox (more on that below).

Practical recipe: three labellers, 100 examples, compute pairwise kappa between each pair. Look at the ceiling (the highest pairwise kappa across the three pairs) and the worst-case pair. If the worst pair is below 0.6, you don’t yet have a trustworthy human signal, and going to Level 3 will produce numbers that don’t mean anything.

Worth noting that not all Level 2 disagreement is a rubric problem. Even with a perfect rubric, trained labellers disagree because they get tired, bring different domain knowledge, read with different care, and drift in their standards over a week. Level 1 tests interpretation; Level 2 tests application. The rubric can be perfect and the workflow can still be noisy.

Level 3: LLM-to-human calibration

The headline calibration. The goal is straightforward: the LLM judge behaves like a competent human grader on your task. The practical loop is also straightforward, and most teams skip it.

Sample N items from your eval. Have humans label them. Have the LLM label the same N items. Compute LLM-to-human kappa per item type, per response length bucket, per category. Read the disagreement examples by hand: they are the gold you’ll feed back into prompt iteration. Update the judge prompt or the rubric. Repeat.

A caveat worth memorising, from Thakur et al.: high aggregate kappa does not mean low bias. The LLM judge might agree with humans on most items but systematically err on a specific slice. The scatter widget below is exactly this: the same 100-item sample, sliced by length, by position, by domain. Aggregate kappa looks healthy; click “Long responses” and watch the cloud lift above the diagonal. That’s the bias the aggregate number was hiding.

The “All 100 items” slice has a kappa that would look fine in a release report. The “Long responses” slice and the “Math domain” slice tell two different stories about where the judge is wrong, in opposite directions. Always slice.

Level 4: LLM-to-LLM calibration

The cheapest level, and the one to run first as a sanity check before paying for any human labels. Even when an LLM judge agrees with humans on average, it may be inconsistent with itself. The same prompt at temperature greater than zero produces different scores across runs. Sibling models (the same family at different sizes, or different families on the same task) sometimes disagree wildly.

There are no humans in the loop here. You take a fixed sample of 100 items, run each through several judge models at several temperatures, and compute the pairwise kappa matrix. If a judge disagrees with itself across temperature settings, you can’t trust it to rank candidates stably. If two judges from different families disagree more than the judges agree with themselves, you’re choosing your eval result by your choice of judge model, which is not what you want.

Cohen’s kappa, in plain language

Most calibration discussions live and die on Cohen’s kappa, and most engineers I’ve worked with have never used it. The textbook formula is short but unilluminating; the intuition is more important than the math.

Percent agreement is the share of items two raters labelled the same way. Cohen’s kappa is the same number, with the agreement-by-chance subtracted off, normalised so 1.0 is perfect agreement and 0 is exactly chance. The formula:

       p_o − p_e
κ  =  ───────────
       1 − p_e

with two quantities to define:

p_o is the observed agreement rate: the fraction of items the two raters labelled the same way.
p_e is the expected agreement rate by chance: what you would get if both raters labelled independently at random, but using the actual rates of “correct” and “incorrect” each rater used in this dataset.

Read the formula as a ratio. The numerator p_o − p_e is the agreement above chance: how much better than random the raters did. The denominator 1 − p_e is the room they had to do better: the gap between random and perfect. So kappa is the share of that gap they actually closed. κ = 1 means perfect agreement, κ = 0 means exactly as often as random, κ < 0 means worse than random (anti-correlated).

The rough scale most papers use:

κ < 0: worse than chance (the raters are anti-correlated)
0 to 0.2: slight
0.2 to 0.4: fair
0.4 to 0.6: moderate
0.6 to 0.8: substantial
0.8 to 1.0: excellent

These thresholds are conventional, not laws. Treat them as rough guides.

The applet below is the centerpiece of this post. The 2x2 grid is a confusion matrix between two raters labelling items as “correct” or “incorrect”. You can edit any cell or load one of the canned scenarios. Watch percent agreement, expected agreement, and kappa update together. The “kappa paradox” preset is the one to spend time on: percent agreement above 90%, kappa below 0.1.

If you walk away with one idea from this post, walk away with that paradox. Two raters who agree 92% of the time can have a kappa that says they’re not agreeing meaningfully at all.

When kappa lies: the prevalence paradox

The reason the paradox happens is that pe (chance agreement) depends on the marginal class distribution. If 95% of your items are labelled “correct” in ground truth, two raters who guess “correct” randomly will agree 90% of the time without knowing anything. Kappa then has to subtract off that 90% expected-agreement from any observed agreement, leaving almost no signal to work with.

The slider below shows this directly. As you push prevalence from 50% toward 99%, percent agreement stays near 95% (two raters who agree 95% of the time on each class). But kappa collapses toward zero.

The practical implication: never report kappa without the class distribution alongside it. A near-zero kappa on a highly skewed dataset is not necessarily damning; a near-zero kappa on a balanced dataset is.

A short list of other agreement metrics worth knowing, without going deep on any:

Fleiss’ kappa. Generalises Cohen’s kappa to more than two raters on categorical data.
Krippendorff’s alpha. Handles missing data, multiple raters, and multiple data types (nominal, ordinal, interval). The most flexible.
Spearman or Kendall correlation. For ordinal rankings. Use this when you care about leaderboard ordering, not score accuracy.
ICC (intraclass correlation). For continuous scores like a fluency rating on a 1-100 scale.
Bradley-Terry / Elo. For pairwise preferences, the Chatbot Arena style.

Score calibration vs ranking calibration

The choice of metric reveals what you actually care about. There are two distinct goals:

Score calibration. Is the absolute number right? This matters when you’re reporting quality externally (a regulator, a customer-facing dashboard, a model card claiming a specific benchmark percentage). Use kappa or ICC.

Ranking calibration. Is the ordering right? This matters when you’re picking which model or prompt to ship. Use Spearman or Kendall rank correlation.

Thakur et al. found a surprising thing: a “contains” substring match (the dumbest possible string-comparison grader) has worse kappa than GPT-4-Turbo as a judge on their suite, but produces a better leaderboard ranking than most LLM judges. A grader that’s systematically biased but consistent can rank candidates correctly even when its absolute scores are wrong. The reverse can also happen: a well-calibrated absolute scorer can be unstable enough at the top of the leaderboard to produce wrong rankings.

Pick the metric that matches the question. Reporting external numbers? Kappa or ICC. Choosing which model to ship? Rank correlation.

A short tour of LLM judge biases

The biases worth probing for in any new judge:

Position bias. Pairwise judges over-prefer whichever response appears in position 1. Swap the order and the verdict flips. The widget below makes the failure mode obvious in one click.

Length bias. Longer answers score higher even when the substance is unchanged. The fix is to control for length in your eval set (mix of long and short gold responses) and to spot-check with a length-padding test on any judge you’re considering.

Leniency bias. LLM judges say “correct” more often than humans do. The aggregate dashboard goes up because the judge is generous, not because the agent improved. Catching this requires a human anchor on a sample, not just self-comparison.

Self-preference. A judge from family X tends to favour responses from family X. Detecting this needs cross-family judging: have a Claude judge score Claude vs OpenAI responses and an OpenAI judge score the same pair, then compare.

Rubric overload. Weaker judges get worse when given long, detailed rubrics. Counter-intuitive but documented. If you’re using a smaller model as judge for cost reasons, simplify the rubric.

A calibration checklist

A practical version of everything above, to run before your next eval is taken seriously by anyone outside the team.

Rubric reviewed by two people. They grade 20 examples and reach >0.8 kappa before the rubric scales.
Human-to-human kappa measured on at least 100 items. Report the value as the ceiling for everything that follows.
LLM-to-human kappa measured. Report it sliced at least three ways: by response length, by category, by response position.
LLM-to-LLM kappa measured across temperature and across sibling models. If self-agreement is below 0.8, the judge isn’t ready for production.
Decide whether you care about absolute scores or rankings. Choose your metric accordingly (kappa vs Spearman).
Probe for position bias (swap test) and length bias (padding test). Report the result, not just the headline kappa.
Never report kappa without the class distribution alongside it.

If you can tick every box, your eval is calibrated enough that the number on the dashboard can be defended outside the team. If you can’t tick five of them, the number is a guess wearing engineering clothes.

What’s next

The next post in the series, Part 2, walks through τ-bench, the benchmark that crystallised most of these calibration ideas into a single test for tool-using agents.

If you want the broader practitioner take on eval suites (how the layers fit together, what to build first, how to wire it into CI), the standalone eval-suite post is the companion piece. This calibration post is the chapter on grading; that post is the chapter on the suite around the grading.

References

Thakur et al., 2024. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges. arXiv:2406.12624
Cohen, J. 1960. A coefficient of agreement for nominal scales.
Krippendorff, K. 2018. Content Analysis: An Introduction to Its Methodology. The standard reference for α.
Breaking Down Agent Evals (Part 1): A Practitioner’s Guide
Breaking Down Agent Evals (Part 1A): Building the Eval Suite, Hands-On
What an eval suite is, and how to build one